Multigrid on GPU yields different results to CPU

Issue #50 identified some unexpected behavior when comparing CPU results with GPU results. The convergence history is different when the same PETSc option set is provided for an MG configuration.

Attached are the logs Thomas generated. tandem_GPU.log tandem_CPU.log

The things most likely causing differences in the residual history are probably associated with the ILU and LU solvers. Suggest confirming this by re-running CPU and GPU variants with the following additional options (placed at the end of any existing options).

-ksp_max_it 10
-mg_levels_pc_type jacobi
-mg_coarse_mg_coarse_ksp_type cg
-mg_coarse_mg_coarse_ksp_rtol 1.0e-2
-mg_coarse_mg_coarse_ksp_max_it 100
-mg_coarse_mg_coarse_ksp_type jacobi

I've asked the LLM Qwen2, and got some hints to track down the problems, which might make sense:

The differences in the convergence history between the CPU and GPU runs could be attributed to several factors, particularly those related to the linear solvers and preconditioners used in your PETSc configuration. Here are some potential causes and suggestions to troubleshoot and potentially resolve the issue:

Numerical Precision: GPU computations often use lower precision (e.g., single precision) compared to the double precision typically used on CPUs. This can lead to differences in the numerical results. Ensure that both the CPU and GPU versions are using the same precision.
Different Implementations of Linear Solvers and Preconditioners: The GPU log shows that the AIJCUSPARSE matrix type is used, which is specific to CUDA and might have different implementations of solvers and preconditioners compared to the CPU versions. The AIJ matrix type is used on the CPU, which might have different behavior.
ILU and LU Preconditioners: The ILU and LU preconditioners can be sensitive to the order of operations and numerical precision. The GPU implementation might be using a different algorithm or precision that leads to different results.
Eigenvalue Estimation: The GPU log mentions Chebyshev smoothing, which requires eigenvalue estimates. The estimation process might differ between CPU and GPU, leading to different smoothing effects.
Block Jacobi Preconditioner: The block Jacobi preconditioner uses different block sizes on CPU and GPU (12 vs. 30). This can lead to different preconditioning effects.
GAMG Preconditioner: The GAMG preconditioner has different behavior on CPU and GPU. The GPU version uses a different number of levels (4 vs. 5 on CPU) and might have different threshold settings for dropping small values in the graph.

To address these issues, consider the following steps:

Check Precision: Ensure that both CPU and GPU runs are using the same precision. You can force double precision on the GPU by setting the appropriate PETSc options.
Consistent Preconditioners: Try to use preconditioners that have consistent implementations across CPU and GPU. For example, use the AIJ matrix type on both CPU and GPU with the appropriate PETSc options.
Eigenvalue Estimation: Investigate if the eigenvalue estimation process can be made consistent between CPU and GPU. This might involve using the same estimation method or adjusting the Chebyshev smoothing parameters. Block Size: Use the same block size for the block Jacobi preconditioner on both CPU and GPU.
GAMG Settings: Adjust the GAMG settings to be consistent between CPU and GPU. This might involve setting the same number of levels and threshold values.
Convergence Criteria: Ensure that the convergence criteria are the same for both CPU and GPU runs. This includes the relative and absolute tolerances.
Profiling and Debugging: Use profiling tools to compare the performance and numerical results of the solvers and preconditioners on CPU and GPU. Debugging tools can also help identify where the differences occur.
Documentation and Community: Consult the PETSc documentation and community forums for information on known differences between CPU and GPU implementations. There might be specific recommendations or workarounds for your use case.

By systematically addressing these potential causes, you should be able to identify and mitigate the differences in the convergence history between the CPU and GPU runs.

These steps seem reasonable to me, did you have a chance to try any of them?

Actually- some of these aspects seem to have been discussed in previous tandem meetings. I am not sure an LLM will be a better place to bounce off ideas from than discussing with the team

@hpc4geo folllowing discussion on slack: CPU run cpu_log.txt GPU run gpu_log.txt

(base) ulrich@cachemiss:/export/dump/ulrich/section_7_1/2d$ cat ../../section_8_2/scenario-rc/option3.cfg
-pc_type mg
-ksp_type fgmres
-ksp_rtol 1.0e-9
-mg_levels_ksp_max_it 10
-mg_levels_ksp_chebyshev_esteig 0,0.01,0,1.1
-mg_levels_ksp_type chebyshev
-mg_levels_pc_type jacobi
-mg_coarse_ksp_type cg
-mg_coarse_pc_type jacobi
-mg_coarse_ksp_rtol 1.0e-1
-mg_coarse_ksp_max_it 10
# Turn on monitors for debugging
-ksp_monitor_true_residual
-mg_levels_ksp_monitor
-mg_coarse_ksp_monitor_short
# Force early termination for initial debugging
-ksp_max_it 10
# Report setup
-ksp_view
-log_summary
-options_left
(base) ulrich@cachemiss:/export/dump/ulrich/section_7_1/2d$ cat ../../section_8_2/scenario-rc/cuda.cfg 
-mg_levels_mat_type aijcusparse
-vec_type cuda
-mat_type aijcusparse
-mg_coarse_mat_type aijcusparse

starting from 1 KSP, CPU and GPU differ Main difference:

26,27c26,27
<         type: seqaij
<         rows=92160, cols=92160, bs=6
---
>         type: seqaijcusparse
>         rows=92160, cols=92160
30c30
<           using I-node routines: found 30720 nodes, limit used is 5
---
>           not using I-node routines
59c59

The GPU version doesn't use I-node optimizations (which are typically more beneficial for CPU computations ?).

Tried -mat_block_size 6 for the gpu code: no change on the line starting with 1 KSP unpreconditioned. Tried -mat_no_inode for the CPU code: no change on the same line.

@Thomas-Ulrich Okay. One monitor isn't activated. Let's turn it on. Below are a modified set of options. Please use these new ones.

-pc_type mg
-ksp_type fgmres
-ksp_rtol 1.0e-9
-mg_levels_ksp_max_it 10
-mg_levels_ksp_chebyshev_esteig 0,0.01,0,1.1
-mg_levels_ksp_type chebyshev
-mg_levels_pc_type jacobi
# Turn the norm type 
-mg_levels_ksp_norm_type preconditioned
-mg_coarse_ksp_type cg
-mg_coarse_pc_type jacobi
-mg_coarse_ksp_rtol 1.0e-1
-mg_coarse_ksp_max_it 10
# Turn on monitors for debugging
-ksp_monitor_true_residual
-mg_levels_1_esteig_ksp_monitor_true_residual
-mg_levels_ksp_monitor
-mg_coarse_ksp_monitor_short
# Force early termination for initial debugging
-ksp_max_it 10
# Report setup
-ksp_view
-log_view
-options_left

Tsst 1: Run with above options (CPU) Test 2: Run with the above plus (GPU)

-vec_type cuda
-mat_type aijcusparse

Okay - I start to see the big picture of what is going wrong. The problem is a bad interaction between the smoother and the coarse solver. Here are two supporting observations

Residuals between the down smoother and the up smoother are identical (they shouldn't be compare with cpu run)

    Residual norms for mg_levels_1_ solve.
    0 KSP Residual norm 1.710018445740e-02
    1 KSP Residual norm 7.417800265937e-03
...
    9 KSP Residual norm 5.105136915222e-03
   10 KSP Residual norm 3.987372338677e-03  # <smoother-down last residual>
      Residual norms for mg_coarse_ solve.
      0 KSP Residual norm 0.00232386
      1 KSP Residual norm 0.00168699
      2 KSP Residual norm 0.000824114
      3 KSP Residual norm 0.000371013
      4 KSP Residual norm 0.000222895
    Residual norms for mg_levels_1_ solve.
    0 KSP Residual norm 3.987372338677e-03  # Residual here is identical to  <smoother-down last residual>
    1 KSP Residual norm 1.577556700184e-03
    2 KSP Residual norm 2.624939273350e-03

The first residual of the coarse solver is independent of the outer Krylov iteration. It should not be.

  0 KSP unpreconditioned resid norm 7.740999135093e+02 true resid norm 7.740999135093e+02 ||r(i)||/||b|| 1.000000000000e+00
...
    Residual norms for mg_levels_1_ solve.
    0 KSP Residual norm 1.710018445740e-02
...
   10 KSP Residual norm 3.987372338677e-03
      Residual norms for mg_coarse_ solve.
      0 KSP Residual norm 0.00232386         # <coarse solver residuals identical at each outer iteration>
      1 KSP Residual norm 0.00168699
      2 KSP Residual norm 0.000824114
      3 KSP Residual norm 0.000371013
      4 KSP Residual norm 0.000222895

...

  9 KSP unpreconditioned resid norm 3.076674304842e-01 true resid norm 3.076674304842e-01 ||r(i)||/||b|| 3.974518342076e-04
...
    Residual norms for mg_levels_1_ solve.
    0 KSP Residual norm 2.354684270896e-02
...
   10 KSP Residual norm 1.603202092982e-02
      Residual norms for mg_coarse_ solve.
      0 KSP Residual norm 0.00232386         # <coarse solver residuals identical at each outer iteration>
      1 KSP Residual norm 0.00168699
      2 KSP Residual norm 0.000824114
      3 KSP Residual norm 0.000371013
      4 KSP Residual norm 0.000222895

The last test to try for the moment (however I suspect the outcome will be the same) is using the following options

-pc_type mg
-ksp_type fgmres
-ksp_rtol 1.0e-9
-mg_levels_ksp_max_it 10
-mg_levels_ksp_richardson_scale 0.3
-mg_levels_ksp_type richardson
-mg_levels_pc_type none
# Set the norm type so monitor will report residuals 
-mg_levels_ksp_norm_type preconditioned
-mg_coarse_ksp_type richardson
-mg_coarse_ksp_richardson_scale 0.3
-mg_coarse_pc_type none
-mg_coarse_ksp_rtol 0.5
-mg_coarse_ksp_max_it 10
-mg_coarse_ksp_converged_reason
# Turn on monitors for debugging
-ksp_monitor_true_residual
-mg_levels_ksp_monitor
-mg_coarse_ksp_monitor_true_residual
# Force early termination for initial debugging
-ksp_max_it 10
# Report setup
-ksp_view
-log_view
-options_left

Tsst 3: Run with above options (CPU) Test 4: Run with the above plus (GPU)

-vec_type cuda
-mat_type aijcusparse

Thanks. Looks like I messed up. Could you please re run using both instances of richardson_scale set to be 1.0e-2 (0.3 is too high and is unstable apparently) (edited) 6:11 What’s weird is the GPU run explodes whilst the cpu run is fine

here are the new logs: cpu_log4.txt gpu_log4.txt

Results from a HIP build on LUMI-G indicate there is no problem at all.

CPU run

srun ./app/static cosine.toml --mg_strategy twolevel --mg_coarse_level 1 --petsc -device_view  -ksp_view -ksp_monitor -options_file mg.opts > cpu.out

GPU run

srun ./app/static cosine.toml --mg_strategy twolevel --mg_coarse_level 1 --petsc -device_view  -ksp_view -ksp_monitor -options_file mg.opts  -vec_type hip -mat_type aijhipsparse > gpu.out

Comparing the convergence behavior they are identical.

Output files are attached (along with the options file)

gpu.out.txt cpu.out.txt

mg.opts.txt

One conclusion might be that the spack build @Thomas-Ulrich is running on the test machine is broken.

Hi Dave, That a great result! Maybe it works with aijhipsparse but is buggy with seqaijcusparse? let's see if it works for me on lumi with spack (actually the spack configuration on lumi is confusing, I m still struggling to install several tandem version that does not resolve to the same folder, but this should not prevent me for installing one, and testing it). I will do that after the field trip. Did you have to modify tandem for having it run with hip? (probably same kind of modif as with cuda, right?

So I changed tandem as follow for hip: https://github.com/TEAR-ERC/tandem/tree/thomas/hip

I compiled on Lumi with spack.

Then use an interactive node, as you detailed on slack:

salloc --nodes=1 --account=project_465000831 --partition=dev-g --time=00:30:00 --gpus-per-node=1

Then when I run, I get:

ulrichth@uan01:/scratch/project_465000831/2d> srun static circular_hole.toml  --mesh_file circular_hole_1_005.msh  --mg_strategy twolevel --mg_coarse_level 1 --petsc -device_view  -ksp_view -ksp_monitor -options_file mg.opts  -vec_type hip -mat_type aijhipsparse 
PetscDevice Object: 1 MPI process
  type: host
  id: 0
PetscDevice Object: 1 MPI process
  type: hip
  id: 0
  [0] name: AMD Instinct MI250X
    Compute capability: 9.0
    Multiprocessor Count: 110
    Maximum Grid Dimensions: 2147483647 x 2147483647 x 2147483647
    Maximum Block Dimensions: 1024 x 1024 x 1024
    Maximum Threads Per Block: 1024
    Warp Size: 64
    Total Global Memory (bytes): 68702699520
    Total Constant Memory (bytes): 2147483647
    Shared Memory Per Block (bytes): 65536
    Multiprocessor Clock Rate (KHz): 1700000
    Memory Clock Rate (KHz): 1600000
    Memory Bus Width (bits): 4096
    Peak Memory Bandwidth (GB/s): 1638.400000
    Can map host memory: PETSC_TRUE
    Can execute multiple kernels concurrently: PETSC_TRUE

               ___          ___         _____         ___          ___
      ___     /  /\        /__/\       /  /::\       /  /\        /__/\
     /  /\   /  /::\       \  \:\     /  /:/\:\     /  /:/_      |  |::\
    /  /:/  /  /:/\:\       \  \:\   /  /:/  \:\   /  /:/ /\     |  |:|:\
   /  /:/  /  /:/~/::\  _____\__\:\ /__/:/ \__\:| /  /:/ /:/_  __|__|:|\:\
  /  /::\ /__/:/ /:/\:\/__/::::::::\\  \:\ /  /://__/:/ /:/ /\/__/::::| \:\
 /__/:/\:\\  \:\/:/__\/\  \:\~~\~~\/ \  \:\  /:/ \  \:\/:/ /:/\  \:\~~\__\/
 \__\/  \:\\  \::/      \  \:\  ~~~   \  \:\/:/   \  \::/ /:/  \  \:\
      \  \:\\  \:\       \  \:\        \  \::/     \  \:\/:/    \  \:\
       \__\/ \  \:\       \  \:\        \__\/       \  \::/      \  \:\
              \__\/        \__\/                     \__\/        \__\/

                          tandem version d959ff6

                       stack size limit = unlimited

                              Worker affinity
    ---------9|----------|----------|----------|----------|----------|
    ----------|----------|----------|----------|----------|----------|
    --------

DOFs: 184320
Mesh size: 0.0348181
Multigrid P-levels: 1 2 
Assembly: 0.643815 s
      Residual norms for mg_levels_1_esteig_ solve.
      0 KSP preconditioned resid norm 8.037704555893e+00 true resid norm 2.662831990382e+02 ||r(i)||/||b|| 1.000000000000e+00
      1 KSP preconditioned resid norm 4.722340004803e+00 true resid norm 1.863568841026e+02 ||r(i)||/||b|| 6.998446945796e-01
      2 KSP preconditioned resid norm 3.546908373955e+00 true resid norm 1.506940402628e+02 ||r(i)||/||b|| 5.659164408686e-01
      3 KSP preconditioned resid norm 2.929713111560e+00 true resid norm 1.313232659117e+02 ||r(i)||/||b|| 4.931714294631e-01
      4 KSP preconditioned resid norm 2.601033014142e+00 true resid norm 1.160343528098e+02 ||r(i)||/||b|| 4.357554409324e-01
      5 KSP preconditioned resid norm 2.322237107768e+00 true resid norm 1.024850827973e+02 ||r(i)||/||b|| 3.848725085454e-01
      6 KSP preconditioned resid norm 1.988103274760e+00 true resid norm 8.993133273981e+01 ||r(i)||/||b|| 3.377281520751e-01
      7 KSP preconditioned resid norm 1.782459931748e+00 true resid norm 8.305911829516e+01 ||r(i)||/||b|| 3.119202360313e-01
      8 KSP preconditioned resid norm 1.607976638237e+00 true resid norm 7.763890216529e+01 ||r(i)||/||b|| 2.915651548641e-01
      9 KSP preconditioned resid norm 1.472235584816e+00 true resid norm 7.220406655476e+01 ||r(i)||/||b|| 2.711551716952e-01
     10 KSP preconditioned resid norm 1.348929116073e+00 true resid norm 6.703277980381e+01 ||r(i)||/||b|| 2.517349199872e-01
Solver warmup: 0.34405 s
  0 KSP Residual norm 7.740999135093e+02 
  0 KSP unpreconditioned resid norm 7.740999135093e+02 true resid norm 7.740999135093e+02 ||r(i)||/||b|| 1.000000000000e+00
    Residual norms for mg_levels_1_ solve.
    0 KSP Residual norm 1.710018445740e-02 
    1 KSP Residual norm 7.417800265937e-03 
    2 KSP Residual norm 1.189442691018e-02 
    3 KSP Residual norm 1.231458728026e-02 
    4 KSP Residual norm 7.676083691278e-03 
    5 KSP Residual norm 1.023793497515e-02 
    6 KSP Residual norm 6.084871435531e-03 
    7 KSP Residual norm 7.459714351571e-03 
    8 KSP Residual norm 4.584221164945e-03 
    9 KSP Residual norm 5.105136915222e-03 
   10 KSP Residual norm 3.987372338677e-03 
      Residual norms for mg_coarse_ solve.
      0 KSP Residual norm 0.00232386 
      1 KSP Residual norm 0.00168699 
      2 KSP Residual norm 0.000824114 
      3 KSP Residual norm 0.000371013 
      4 KSP Residual norm 0.000222895 
[0]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
[0]PETSC ERROR: GPU error
[0]PETSC ERROR: hipSPARSE errorcode 3 (HIPSPARSE_STATUS_INVALID_VALUE)
[0]PETSC ERROR: WARNING! There are unused option(s) set! Could be the program crashed before usage or a spelling mistake, etc!
[0]PETSC ERROR:   Option left: name:-options_left (no value) source: file
[0]PETSC ERROR: See https://petsc.org/release/faq/ for trouble shooting.
[0]PETSC ERROR: Petsc Release Version 3.20.1, Oct 31, 2023 
[0]PETSC ERROR: --petsc on a  named nid007976 by ulrichth Fri Sep 20 19:24:28 2024
[0]PETSC ERROR: Configure options --prefix=/project/project_465000831/spack_tandem/23.09/0.21.0/petsc-3.20.1-wecdeik --with-ssl=0 --download-c2html=0 --download-sowing=0 --download-hwloc=0 --with-make-exec=make --with-cc=/opt/cray/pe/mpich/8.1.27/ofi/gnu/9.1/bin/mpicc --with-cxx=/opt/cray/pe/mpich/8.1.27/ofi/gnu/9.1/bin/mpicxx --with-fc=/opt/cray/pe/mpich/8.1.27/ofi/gnu/9.1/bin/mpif90 --with-precision=double --with-scalar-type=real --with-shared-libraries=1 --with-debugging=0 --with-openmp=0 --with-64-bit-indices=1 --with-blaslapack-lib=/opt/cray/pe/libsci/23.09.1.1/gnu/10.3/x86_64/lib/libsci_gnu.so --with-memalign=32 --with-x=0 --with-sycl=0 --with-clanguage=C --with-cuda=0 --with-hip=1 --with-hip-include=/appl/lumi/SW/CrayEnv/EB/rocm/5.6.1/hip/include --with-metis=1 --with-metis-include=/project/project_465000831/spack_tandem/23.09/0.21.0/metis-5.1.0-wonbaai/include --with-metis-lib=/project/project_465000831/spack_tandem/23.09/0.21.0/metis-5.1.0-wonbaai/lib/libmetis.so --with-hypre=1 --with-hypre-include=/project/project_465000831/spack_tandem/23.09/0.21.0/hypre-2.29.0-b6zxd2h/include --with-hypre-lib=/project/project_465000831/spack_tandem/23.09/0.21.0/hypre-2.29.0-b6zxd2h/lib/libHYPRE.so --with-parmetis=1 --with-parmetis-include=/project/project_465000831/spack_tandem/23.09/0.21.0/parmetis-4.0.3-jl23cqy/include --with-parmetis-lib=/project/project_465000831/spack_tandem/23.09/0.21.0/parmetis-4.0.3-jl23cqy/lib/libparmetis.so --with-kokkos=0 --with-kokkos-kernels=0 --with-superlu_dist=1 --with-superlu_dist-include=/project/project_465000831/spack_tandem/23.09/0.21.0/superlu-dist-8.1.2-ca6tkuz/include --with-superlu_dist-lib=/project/project_465000831/spack_tandem/23.09/0.21.0/superlu-dist-8.1.2-ca6tkuz/lib/libsuperlu_dist.so --with-ptscotch=0 --with-suitesparse=0 --with-hdf5=1 --with-hdf5-include=/project/project_465000831/spack_tandem/23.09/0.21.0/hdf5-1.14.3-nugrcny/include --with-hdf5-lib=/project/project_465000831/spack_tandem/23.09/0.21.0/hdf5-1.14.3-nugrcny/lib/libhdf5.so --with-zlib=0 --with-mumps=1 --with-mumps-include=/project/project_465000831/spack_tandem/23.09/0.21.0/mumps-5.5.1-5fli5tk/include --with-mumps-lib="/project/project_465000831/spack_tandem/23.09/0.21.0/mumps-5.5.1-5fli5tk/lib/libsmumps.so /project/project_465000831/spack_tandem/23.09/0.21.0/mumps-5.5.1-5fli5tk/lib/libzmumps.so /project/project_465000831/spack_tandem/23.09/0.21.0/mumps-5.5.1-5fli5tk/lib/libcmumps.so /project/project_465000831/spack_tandem/23.09/0.21.0/mumps-5.5.1-5fli5tk/lib/libdmumps.so /project/project_465000831/spack_tandem/23.09/0.21.0/mumps-5.5.1-5fli5tk/lib/libmumps_common.so /project/project_465000831/spack_tandem/23.09/0.21.0/mumps-5.5.1-5fli5tk/lib/libpord.so" --with-trilinos=0 --with-fftw=0 --with-valgrind=0 --with-gmp=0 --with-libpng=0 --with-giflib=0 --with-mpfr=0 --with-netcdf=0 --with-pnetcdf=0 --with-moab=0 --with-random123=0 --with-exodusii=0 --with-cgns=0 --with-memkind=0 --with-p4est=0 --with-saws=0 --with-yaml=0 --with-hwloc=0 --with-libjpeg=0 --with-scalapack=1 --with-scalapack-lib=/project/project_465000831/spack_tandem/23.09/0.21.0/netlib-scalapack-2.2.0-vaujoyo/lib/libscalapack.so --with-strumpack=0 --with-mmg=0 --with-parmmg=0 --with-tetgen=0 --with-hip-arch=gfx90a HIPPPFLAGS="-I/appl/lumi/SW/CrayEnv/EB/rocm/5.6.1/include -I/appl/lumi/SW/CrayEnv/EB/rocm/5.6.1/include -I/project/project_465000831/spack_tandem/23.09/0.21.0/hipsolver-5.6.1-xes4kff/include -I/appl/lumi/SW/CrayEnv/EB/rocm/5.6.1/include -I/appl/lumi/SW/CrayEnv/EB/rocm/5.6.1/include -I/appl/lumi/SW/CrayEnv/EB/rocm/5.6.1/include -I/appl/lumi/SW/CrayEnv/EB/rocm/5.6.1/include -I/appl/lumi/SW/CrayEnv/EB/rocm/5.6.1/include -I/appl/lumi/SW/CrayEnv/EB/rocm/5.6.1/include " --with-hip-lib="/appl/lumi/SW/CrayEnv/EB/rocm/5.6.1/lib/libhipsparse.so /appl/lumi/SW/CrayEnv/EB/rocm/5.6.1/lib/libhipblas.so /project/project_465000831/spack_tandem/23.09/0.21.0/hipsolver-5.6.1-xes4kff/lib/libhipsolver.so /appl/lumi/SW/CrayEnv/EB/rocm/5.6.1/lib/librocsparse.so /appl/lumi/SW/CrayEnv/EB/rocm/5.6.1/lib/librocsolver.so /appl/lumi/SW/CrayEnv/EB/rocm/5.6.1/lib/librocblas.so  -L/appl/lumi/SW/CrayEnv/EB/rocm/5.6.1/hip/lib -lamdhip64"
[0]PETSC ERROR: #1 MatMultAddKernel_SeqAIJHIPSPARSE() at /tmp/ulrichth/spack-stage/spack-stage-petsc-3.20.1-wecdeikv5uvhsekejbvg7ojwun3x4mh3/spack-src/src/mat/impls/aij/seq/seqhipsparse/aijhipsparse.hip.cpp:3131
[0]PETSC ERROR: #2 MatMultAdd_SeqAIJHIPSPARSE() at /tmp/ulrichth/spack-stage/spack-stage-petsc-3.20.1-wecdeikv5uvhsekejbvg7ojwun3x4mh3/spack-src/src/mat/impls/aij/seq/seqhipsparse/aijhipsparse.hip.cpp:3004
[0]PETSC ERROR: #3 MatMultAdd() at /tmp/ulrichth/spack-stage/spack-stage-petsc-3.20.1-wecdeikv5uvhsekejbvg7ojwun3x4mh3/spack-src/src/mat/interface/matrix.c:2780
[0]PETSC ERROR: #4 MatInterpolateAdd() at /tmp/ulrichth/spack-stage/spack-stage-petsc-3.20.1-wecdeikv5uvhsekejbvg7ojwun3x4mh3/spack-src/src/mat/interface/matrix.c:8593
[0]PETSC ERROR: #5 PCMGMCycle_Private() at /tmp/ulrichth/spack-stage/spack-stage-petsc-3.20.1-wecdeikv5uvhsekejbvg7ojwun3x4mh3/spack-src/src/ksp/pc/impls/mg/mg.c:87
[0]PETSC ERROR: #6 PCApply_MG_Internal() at /tmp/ulrichth/spack-stage/spack-stage-petsc-3.20.1-wecdeikv5uvhsekejbvg7ojwun3x4mh3/spack-src/src/ksp/pc/impls/mg/mg.c:611
[0]PETSC ERROR: #7 PCApply_MG() at /tmp/ulrichth/spack-stage/spack-stage-petsc-3.20.1-wecdeikv5uvhsekejbvg7ojwun3x4mh3/spack-src/src/ksp/pc/impls/mg/mg.c:633
[0]PETSC ERROR: #8 PCApply() at /tmp/ulrichth/spack-stage/spack-stage-petsc-3.20.1-wecdeikv5uvhsekejbvg7ojwun3x4mh3/spack-src/src/ksp/pc/interface/precon.c:486
[0]PETSC ERROR: #9 KSP_PCApply() at /tmp/ulrichth/spack-stage/spack-stage-petsc-3.20.1-wecdeikv5uvhsekejbvg7ojwun3x4mh3/spack-src/include/petsc/private/kspimpl.h:383
[0]PETSC ERROR: #10 KSPFGMRESCycle() at /tmp/ulrichth/spack-stage/spack-stage-petsc-3.20.1-wecdeikv5uvhsekejbvg7ojwun3x4mh3/spack-src/src/ksp/ksp/impls/gmres/fgmres/fgmres.c:123
[0]PETSC ERROR: #11 KSPSolve_FGMRES() at /tmp/ulrichth/spack-stage/spack-stage-petsc-3.20.1-wecdeikv5uvhsekejbvg7ojwun3x4mh3/spack-src/src/ksp/ksp/impls/gmres/fgmres/fgmres.c:233
[0]PETSC ERROR: #12 KSPSolve_Private() at /tmp/ulrichth/spack-stage/spack-stage-petsc-3.20.1-wecdeikv5uvhsekejbvg7ojwun3x4mh3/spack-src/src/ksp/ksp/interface/itfunc.c:910
[0]PETSC ERROR: #13 KSPSolve() at /tmp/ulrichth/spack-stage/spack-stage-petsc-3.20.1-wecdeikv5uvhsekejbvg7ojwun3x4mh3/spack-src/src/ksp/ksp/interface/itfunc.c:1082
[0]PETSC ERROR: #14 solve() at /tmp/ulrichth/spack-stage/spack-stage-tandem-hip-mewymp7bs55q5jgdybjyomw22alvng2p/spack-src/app/common/PetscLinearSolver.h:42
terminate called after throwing an instance of 'tndm::petsc_error'
  what():  GPU error
srun: error: nid007976: task 0: Aborted
srun: Terminating StepId=8033350.5

I need to test with rocm 6 (instead of 5.6.1)

ulrichth@uan01:/project/project_465000831/spack_tandem> spack spec -I tandem@hip%gcc@13+rocm amdgpu_target=gfx90a domain_dimension=3 polynomial_degree=2

Input spec
--------------------------------
 -   tandem@hip%gcc@13+rocm amdgpu_target=gfx90a domain_dimension=3 polynomial_degree=2

Concretized
--------------------------------
[+]  tandem@hip%gcc@13.2.1~cuda~ipo~libxsmm~python+rocm amdgpu_target=gfx90a build_system=cmake build_type=Release domain_dimension=3 generator=make min_quadrature_order=0 polynomial_degree=2 arch=linux-sles15-zen2
[e]      ^cmake@3.20.4%gcc@13.2.1~doc+ncurses+ownlibs build_system=generic build_type=Release arch=linux-sles15-zen2
[e]      ^cray-mpich@8.1.27%gcc@13.2.1+wrappers build_system=generic arch=linux-sles15-zen2
[+]      ^eigen@3.4.0%gcc@13.2.1~ipo build_system=cmake build_type=RelWithDebInfo generator=make arch=linux-sles15-zen2
[+]      ^gmake@4.4.1%gcc@13.2.1~guile build_system=generic arch=linux-sles15-zen2
[e]      ^hip@5.6.1%gcc@13.2.1~cuda+rocm build_system=cmake build_type=Release generator=make patches=aee7249,c2ee21c,e73e91b arch=linux-sles15-zen2
[e]      ^hsa-rocr-dev@5.6.1%gcc@13.2.1+image+shared build_system=cmake build_type=Release generator=make patches=9267179 arch=linux-sles15-zen2
[e]      ^llvm-amdgpu@5.6.1%gcc@13.2.1~link_llvm_dylib~llvm_dylib~openmp+rocm-device-libs build_system=cmake build_type=Release generator=ninja patches=a08bbe1,b66529f,d35aec9 arch=linux-sles15-zen2
[+]      ^lua@5.4.4%gcc@13.2.1~pcfile+shared build_system=makefile fetcher=curl arch=linux-sles15-zen2
[+]          ^curl@8.4.0%gcc@13.2.1~gssapi~ldap~libidn2~librtmp~libssh~libssh2+nghttp2 build_system=autotools libs=shared,static tls=openssl arch=linux-sles15-zen2
[+]              ^nghttp2@1.57.0%gcc@13.2.1 build_system=autotools arch=linux-sles15-zen2
[+]              ^openssl@3.1.3%gcc@13.2.1~docs+shared build_system=generic certs=mozilla arch=linux-sles15-zen2
[+]                  ^ca-certificates-mozilla@2023-05-30%gcc@13.2.1 build_system=generic arch=linux-sles15-zen2
[+]                  ^perl@5.38.0%gcc@13.2.1+cpanm+opcode+open+shared+threads build_system=generic patches=714e4d1 arch=linux-sles15-zen2
[+]                      ^berkeley-db@18.1.40%gcc@13.2.1+cxx~docs+stl build_system=autotools patches=26090f4,b231fcc arch=linux-sles15-zen2
[+]                      ^bzip2@1.0.8%gcc@13.2.1~debug~pic+shared build_system=generic arch=linux-sles15-zen2
[+]                      ^gdbm@1.23%gcc@13.2.1 build_system=autotools arch=linux-sles15-zen2
[+]              ^pkgconf@1.9.5%gcc@13.2.1 build_system=autotools arch=linux-sles15-zen2
[+]          ^ncurses@6.4%gcc@13.2.1~symlinks+termlib abi=none build_system=autotools arch=linux-sles15-zen2
[+]          ^readline@8.2%gcc@13.2.1 build_system=autotools patches=bbf97f1 arch=linux-sles15-zen2
[+]          ^unzip@6.0%gcc@13.2.1 build_system=makefile arch=linux-sles15-zen2
[+]      ^metis@5.1.0%gcc@13.2.1~gdb+int64~ipo~real64+shared build_system=cmake build_type=Release generator=make patches=4991da9,93a7903,b1225da arch=linux-sles15-zen2
[+]      ^parmetis@4.0.3%gcc@13.2.1~gdb+int64~ipo+shared build_system=cmake build_type=Release generator=make patches=4f89253,50ed208,704b84f arch=linux-sles15-zen2
[+]      ^petsc@3.20.1%gcc@13.2.1~X~batch~cgns~complex~cuda~debug+double~exodusii~fftw+fortran~giflib+hdf5~hpddm~hwloc+hypre+int64~jpeg~knl~kokkos~libpng~libyaml~memkind+metis~mkl-pardiso~mmg~moab~mpfr+mpi+mumps~openmp~p4est~parmmg~ptscotch~random123+rocm~saws+scalapack+shared~strumpack~suite-sparse+superlu-dist~sycl~tetgen~trilinos~valgrind amdgpu_target=gfx90a build_system=generic clanguage=C memalign=32 arch=linux-sles15-zen2
[e]          ^cray-libsci@23.09.1.1%gcc@13.2.1~mpi~openmp+shared build_system=generic arch=linux-sles15-zen2
[+]          ^diffutils@3.9%gcc@13.2.1 build_system=autotools arch=linux-sles15-zen2
[+]              ^libiconv@1.17%gcc@13.2.1 build_system=autotools libs=shared,static arch=linux-sles15-zen2
[+]          ^hdf5@1.14.3%gcc@13.2.1~cxx~fortran~hl~ipo~java~map+mpi+shared~szip~threadsafe+tools api=default build_system=cmake build_type=Release generator=make arch=linux-sles15-zen2
[e]          ^hipblas@5.6.1%gcc@13.2.1~cuda+rocm amdgpu_target=auto build_system=cmake build_type=Release generator=make arch=linux-sles15-zen2
[+]          ^hipsolver@5.6.1%gcc@13.2.1~cuda~ipo+rocm amdgpu_target=auto build_system=cmake build_type=Release generator=make patches=cfbe3d1 arch=linux-sles15-zen2
[e]              ^rocm-cmake@5.6.1%gcc@13.2.1 build_system=cmake build_type=Release generator=make arch=linux-sles15-zen2
[e]          ^hipsparse@5.6.1%gcc@13.2.1~cuda+rocm amdgpu_target=auto build_system=cmake build_type=Release generator=make arch=linux-sles15-zen2
[+]          ^hypre@2.29.0%gcc@13.2.1~caliper~complex~cuda~debug+fortran~gptune+int64~internal-superlu~magma~mixedint+mpi~openmp~rocm+shared~superlu-dist~sycl~umpire~unified-memory build_system=autotools arch=linux-sles15-zen2
[+]          ^mumps@5.5.1%gcc@13.2.1~blr_mt+complex+double+float~incfort~int64+metis+mpi~openmp+parmetis~ptscotch~scotch+shared build_system=generic patches=373d736 arch=linux-sles15-zen2
[+]          ^netlib-scalapack@2.2.0%gcc@13.2.1~ipo~pic+shared build_system=cmake build_type=Release generator=make patches=072b006,1c9ce5f,244a9aa arch=linux-sles15-zen2
[e]          ^python@3.11.7%gcc@13.2.1+bz2+crypt+ctypes+dbm~debug+libxml2+lzma+nis~optimizations+pic+pyexpat+pythoncmd+readline+shared+sqlite3+ssl~tkinter+uuid+zlib build_system=generic patches=13fa8bf,b0615b2,ebdca64,f2fd060 arch=linux-sles15-zen2
[e]          ^rocblas@5.6.1%gcc@13.2.1+tensile amdgpu_target=auto build_system=cmake build_type=Release generator=make arch=linux-sles15-zen2
[e]          ^rocprim@5.6.1%gcc@13.2.1 amdgpu_target=auto build_system=cmake build_type=Release generator=make arch=linux-sles15-zen2
[e]          ^rocrand@5.6.1%gcc@13.2.1+hiprand amdgpu_target=auto build_system=cmake build_type=Release generator=make arch=linux-sles15-zen2
[e]          ^rocsolver@5.6.1%gcc@13.2.1+optimal amdgpu_target=auto build_system=cmake build_type=Release generator=make arch=linux-sles15-zen2
[e]          ^rocsparse@5.6.1%gcc@13.2.1~test amdgpu_target=auto build_system=cmake build_type=Release generator=make arch=linux-sles15-zen2
[e]          ^rocthrust@5.6.1%gcc@13.2.1 amdgpu_target=auto build_system=cmake build_type=Release generator=make arch=linux-sles15-zen2
[+]          ^superlu-dist@8.1.2%gcc@13.2.1~cuda+int64~ipo~openmp~rocm+shared build_system=cmake build_type=Release generator=make arch=linux-sles15-zen2
[+]      ^zlib-ng@2.1.4%gcc@13.2.1+compat+opt build_system=autotools arch=linux-sles15-zen2

@Thomas-Ulrich You should try with the latest petsc version (3.21.5).

HIP Mat implementations are currently tagged as "under development". The team is constantly adding support for hip aij matrices (even as of v3.21.5). Hence I suspect what you see is due to a less complete implementation cf v3.21.5 (which I used).

Thx, tried with v3.21.5 and got the same problem. I guess the problem is rocm-6.0.3 vs rocm-5.6.1. But this one is not on the spck of lumi (just updated but outdated), only as a module.

I you could share how you build petsc and tandem, that would be useful.

ulrichth@uan01:/project/project_465000831/petsc> module list

Currently Loaded Modules:
  1) craype-x86-rome      3) craype-network-ofi       5) xpmem/2.8.2-1.0_5.1__g84a27a5.shasta   7) craype/2.7.31.11   9) cray-mpich/8.1.29    11) PrgEnv-cray/8.5.0      13) lumi-tools/24.05 (S)  15) spack/23.09
  2) libfabric/1.15.2.0   4) perftools-base/24.03.0   6) cce/17.0.1                             8) cray-dsmml/0.3.0  10) cray-libsci/24.03.0  12) ModuleLabel/label (S)  14) init-lumi/0.2    (S)  16) rocm/6.0.3

  Where:
   S:  Module is Sticky, requires --force to unload or purge

Here is what I tried (based on your log):

export CPATH=/opt/rocm-6.0.3/include/rocm-core:$CPATH
git clone -b release https://gitlab.com/petsc/petsc.git petsc
./configure --with-mpi-dir=/opt/cray/pe/mpich/8.1.29/ofi/crayclang/17.0   --download-c2html=0 --download-fblaslapack=1 --download-hwloc=0   --download-sowing=0 --with-x=0   --with-hip-dir=/opt/rocm-6.0.3   --with-hipc=hipcc   --with-hip-arch=gfx90a   --download-kokkos --download-kokkos-kernels   --download-metis --download-parmetis --with-memalign=32   --with-64-bit-indices --with-fortran-bindings=0
make PETSC_DIR=/pfs/lustrep4/projappl/project_465000831/petsc PETSC_ARCH=arch-linux-c-debug -j 32

and I get:

/pfs/lustrep4/projappl/project_465000831/petsc/src/sys/objects/device/util/memory.c:55:18: error: no member named 'memoryType' in 'struct hipPointerAttribute_t'
   55 |     mtype = attr.memoryType;
      |             ~~~~ ^
1 error generated.
gmake[3]: *** [gmakefile:197: arch-linux-c-debug/obj/src/sys/objects/device/util/memory.o] Error 1
gmake[3]: *** Waiting for unfinished jobs....

@hpc4geo: I tested the exact same setup you are running, but on heisenbug, and got the same results as you (CPU == GPU to some extent). So the problem is still there, and my spack installation is not guilty. Here is the exact setup I'm running: problematic_setup.tar.gz

You can (probably) check on LUMI that the convergence issue is not solve

mpiexec --bind-to core -n 1 static circular_hole.toml --resolution 0.8  --matrix_free yes --mg_strategy twolevel --mg_coarse_level 1 --mesh_file circular_hole_1_005.msh --petsc -options_file option4.cfg  > cpu_new.txt
mpiexec --bind-to core -n 1 static circular_hole.toml --resolution 0.8  --matrix_free yes --mg_strategy twolevel --mg_coarse_level 1 --mesh_file circular_hole_1_005.msh --petsc -options_file option4cuda.cfg  > gpu_new.txt

edit: added the logs compiled with:

spack install -j 32 tandem@main polynomial_degree=2 domain_dimension=2 +cuda cuda_arch=86 %gcc@12

cpu_new.txt gpu_new.txt tandem_install_dependencies.txt

@Thomas-Ulrich . The software stack is pretty complicated. It mixes many different packages. We are also mixing how we build / assemble the stack. We are also mix and matching devices (AMD vs NVIDIA).

To try and make sense of all this, I've put everything we have found into a table. Please review the entries with your name and let me know if I have misreported something related to those builds.

user	machine	petsc version	build process	device arch	device libs	outcome
dave	lumi-g	3.21.5 release + EasyBuild patch	self installed	AMD MI250x	rocm6 + hip	success
dave	lumi-g	petsc devel. repo, main branch	self installed	AMD MI250x	rocm6 + hip	success
thomas	heisenbug	3.21.5 release	spack	NVIDIA GeForce RTX 3090	cuda-12.5.0	success
thomas	lumi-g	3.20.1	spack	AMD MI250x	rocm5 + hip	fail - `hipSPARSE errorcode 3 (HIPSPARSE_STATUS_INVALID_VALUE)`
thomas	lumi-g	3.21.5	module	AMD MI250x	rocm5 + hip	fail - `hipSPARSE errorcode 3 (HIPSPARSE_STATUS_INVALID_VALUE)`

Note that success implies the convergence history of the solver using MG is nearly identical when you use the CPU, or the GPU.

Thx, tried with v3.21.5 and got the same problem. I guess the problem is rocm-6.0.3 vs rocm-5.6.1. But this one is not on the spck of lumi (just updated but outdated), only as a module.

I you could share how you build petsc and tandem, that would be useful.
ulrichth@uan01:/project/project_465000831/petsc> module list

Currently Loaded Modules:
  1) craype-x86-rome      3) craype-network-ofi       5) xpmem/2.8.2-1.0_5.1__g84a27a5.shasta   7) craype/2.7.31.11   9) cray-mpich/8.1.29    11) PrgEnv-cray/8.5.0      13) lumi-tools/24.05 (S)  15) spack/23.09
  2) libfabric/1.15.2.0   4) perftools-base/24.03.0   6) cce/17.0.1                             8) cray-dsmml/0.3.0  10) cray-libsci/24.03.0  12) ModuleLabel/label (S)  14) init-lumi/0.2    (S)  16) rocm/6.0.3

  Where:
   S:  Module is Sticky, requires --force to unload or purge

I've added a description here https://github.com/TEAR-ERC/tandem/issues/76

Here is what I tried (based on your log):

export CPATH=/opt/rocm-6.0.3/include/rocm-core:$CPATH
git clone -b release https://gitlab.com/petsc/petsc.git petsc
./configure --with-mpi-dir=/opt/cray/pe/mpich/8.1.29/ofi/crayclang/17.0   --download-c2html=0 --download-fblaslapack=1 --download-hwloc=0   --download-sowing=0 --with-x=0   --with-hip-dir=/opt/rocm-6.0.3   --with-hipc=hipcc   --with-hip-arch=gfx90a   --download-kokkos --download-kokkos-kernels   --download-metis --download-parmetis --with-memalign=32   --with-64-bit-indices --with-fortran-bindings=0
make PETSC_DIR=/pfs/lustrep4/projappl/project_465000831/petsc PETSC_ARCH=arch-linux-c-debug -j 32

and I get:

/pfs/lustrep4/projappl/project_465000831/petsc/src/sys/objects/device/util/memory.c:55:18: error: no member named 'memoryType' in 'struct hipPointerAttribute_t'
   55 |     mtype = attr.memoryType;
      |             ~~~~ ^
1 error generated.
gmake[3]: *** [gmakefile:197: arch-linux-c-debug/obj/src/sys/objects/device/util/memory.o] Error 1
gmake[3]: *** Waiting for unfinished jobs....

Right. The choices are

Patch the tarball release to ensure compilation.
Directly use the main branch of the developer repo. This does not require a patch. I've described all this in issue 76.

As for tandem I had to hack a few things. I didn't do the job properly hence I didn't commit these changes. The changes were as follows

diff --git a/app/CMakeLists.txt b/app/CMakeLists.txt
index bc48092..3a44816 100644
--- a/app/CMakeLists.txt
+++ b/app/CMakeLists.txt
@@ -190,12 +190,12 @@ set(APP_COMMON_SRCS
     #pc/lspoly.c
     pc/register.cpp
 )
-if(${LAPACK_FOUND})
-    list(APPEND APP_COMMON_SRCS
-         pc/eigdeflate.c
-         pc/reig_aux.c
-    )
-endif()
+#if(${LAPACK_FOUND})
+#    list(APPEND APP_COMMON_SRCS
+#         pc/eigdeflate.c
+#         pc/reig_aux.c
+#    )
+#endif()
 add_library(app-common ${APP_COMMON_SRCS})
 target_compile_definitions(app-common PUBLIC "ALIGNMENT=${ALIGNMENT}")
 target_link_libraries(app-common PUBLIC

diff --git a/app/pc/register.cpp b/app/pc/register.cpp
index e75b08b..60a9716 100644
--- a/app/pc/register.cpp
+++ b/app/pc/register.cpp
@@ -14,7 +14,7 @@ namespace tndm {
 PetscErrorCode register_PCs() {
     PetscFunctionBegin;
 #ifdef HAVE_LAPACK
-    CHKERRQ(PCRegister("eigdeflate", PCCreate_eigdeflate));
+//    CHKERRQ(PCRegister("eigdeflate", PCCreate_eigdeflate));
 #endif
     PetscFunctionReturn(0);
 }

Several tandem source files containing pure PETSc code need updating for PETSc 3.21.5. We rarely use this functionality, hence I just stop compiling them out of laziness.

diff --git a/src/mesh/GlobalSimplexMesh.cpp b/src/mesh/GlobalSimplexMesh.cpp
index 260d1a0..8856919 100644
--- a/src/mesh/GlobalSimplexMesh.cpp
+++ b/src/mesh/GlobalSimplexMesh.cpp
@@ -392,6 +392,7 @@ void GlobalSimplexMesh<D>::deleteDomainBoundaryFaces(facet_set_t& boundaryFaces)
     }
 }

+template class GlobalSimplexMesh<1u>;
 template class GlobalSimplexMesh<2u>;
 template class GlobalSimplexMesh<3u>;

I am unclear why the 1D instance was suddenly required.

@Thomas-Ulrich I re-ran your problematic setup on LUMI.

Please note that the arg --matrix_free yes was omitted.

CPU

srun ./app/static circular_hole.toml --resolution 0.8  --mg_strategy twolevel --mg_coarse_level 1 --mesh_file circular_hole_1_005.msh --petsc -options_file mg.opts  > lumi_circhole_cpu.txt

GPU

srun ./app/static circular_hole.toml --resolution 0.8  --mg_strategy twolevel --mg_coarse_level 1 --mesh_file circular_hole_1_005.msh --petsc -options_file mg.opts  -vec_type hip -mat_type aijhipsparse > lumi_circhole_gpu.txt

The output files are attached - the results are nearly identical.

Now, if you use the option --matrix_free yes with hip, I get the following error. I am trying to track this down and fix it.

[0]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
[0]PETSC ERROR: Invalid argument
[0]PETSC ERROR: Object (seq) is not seqhip or mpihip
[0]PETSC ERROR: WARNING! There are unused option(s) set! Could be the program crashed before usage or a spelling mistake, etc!
[0]PETSC ERROR:   Option left: name:-options_left (no value) source: file
[0]PETSC ERROR: See https://petsc.org/release/faq/ for trouble shooting.
[0]PETSC ERROR: Petsc Development GIT revision: v3.21.5-594-gd4cfec0c9b8  GIT Date: 2024-09-20 03:11:45 +0000
[0]PETSC ERROR: --petsc with 1 MPI process(es) and PETSC_ARCH arch-cray-c-debug-rocm-hip-tandem-vanil on nid005014 by maydave2 Thu Sep 26 21:31:51 2024
[0]PETSC ERROR: Configure options: --download-c2html=0 --download-fblaslapack=1 --download-hwloc=0 --download-cmake --download-metis --download-parmetis --download-sowing=0 --with-64-bit-indices --with-fortran-bindings=0 --with-hip --with-hip-arch=gfx90a --with-hipc=hipcc --with-memalign=32 --with-mpi-dir=/opt/cray/pe/mpich/8.1.29/ofi/crayclang/17.0 --with-x=0 PETSC_ARCH=arch-cray-c-debug-rocm-hip-tandem-vanil
[0]PETSC ERROR: #1 GetArray() at /projappl/project_465001082/dmay/software/petsc-dev-git/include/petsc/private/veccupmimpl.h:581
[0]PETSC ERROR: #2 VecCUPMGetArrayAsync_Private() at /pfs/lustrep3/projappl/project_465001082/dmay/software/petsc-dev-git/src/vec/vec/impls/seq/cupm/hip/../vecseqcupm.hpp:206
[0]PETSC ERROR: #3 VecCUPMGetArrayReadAsync() at /pfs/lustrep3/projappl/project_465001082/dmay/software/petsc-dev-git/src/vec/vec/impls/seq/cupm/hip/../vecseqcupm.hpp:240
[0]PETSC ERROR: #4 VecHIPGetArrayRead() at /pfs/lustrep3/projappl/project_465001082/dmay/software/petsc-dev-git/src/vec/vec/impls/seq/cupm/hip/vecseqcupm.hip.cpp:251
[0]PETSC ERROR: #5 MatMultAddKernel_SeqAIJHIPSPARSE() at /pfs/lustrep3/projappl/project_465001082/dmay/software/petsc-dev-git/src/mat/impls/aij/seq/seqhipsparse/aijhipsparse.hip.cpp:3069
[0]PETSC ERROR: #6 MatMultTranspose_SeqAIJHIPSPARSE() at /pfs/lustrep3/projappl/project_465001082/dmay/software/petsc-dev-git/src/mat/impls/aij/seq/seqhipsparse/aijhipsparse.hip.cpp:3024
[0]PETSC ERROR: #7 MatMultTranspose() at /pfs/lustrep3/projappl/project_465001082/dmay/software/petsc-dev-git/src/mat/interface/matrix.c:2724
[0]PETSC ERROR: #8 MatRestrict() at /pfs/lustrep3/projappl/project_465001082/dmay/software/petsc-dev-git/src/mat/interface/matrix.c:8841
[0]PETSC ERROR: #9 PCMGMCycle_Private() at /pfs/lustrep3/projappl/project_465001082/dmay/software/petsc-dev-git/src/ksp/pc/impls/mg/mg.c:68
[0]PETSC ERROR: #10 PCApply_MG_Internal() at /pfs/lustrep3/projappl/project_465001082/dmay/software/petsc-dev-git/src/ksp/pc/impls/mg/mg.c:626
[0]PETSC ERROR: #11 PCApply_MG() at /pfs/lustrep3/projappl/project_465001082/dmay/software/petsc-dev-git/src/ksp/pc/impls/mg/mg.c:648
[0]PETSC ERROR: #12 PCApply() at /pfs/lustrep3/projappl/project_465001082/dmay/software/petsc-dev-git/src/ksp/pc/interface/precon.c:522
[0]PETSC ERROR: #13 KSP_PCApply() at /projappl/project_465001082/dmay/software/petsc-dev-git/include/petsc/private/kspimpl.h:411
[0]PETSC ERROR: #14 KSPFGMRESCycle() at /pfs/lustrep3/projappl/project_465001082/dmay/software/petsc-dev-git/src/ksp/ksp/impls/gmres/fgmres/fgmres.c:123
[0]PETSC ERROR: #15 KSPSolve_FGMRES() at /pfs/lustrep3/projappl/project_465001082/dmay/software/petsc-dev-git/src/ksp/ksp/impls/gmres/fgmres/fgmres.c:235
[0]PETSC ERROR: #16 KSPSolve_Private() at /pfs/lustrep3/projappl/project_465001082/dmay/software/petsc-dev-git/src/ksp/ksp/interface/itfunc.c:900
[0]PETSC ERROR: #17 KSPSolve() at /pfs/lustrep3/projappl/project_465001082/dmay/software/petsc-dev-git/src/ksp/ksp/interface/itfunc.c:1075
[0]PETSC ERROR: #18 solve() at /users/maydave2/codes/tandem/app/common/PetscLinearSolver.h:42
terminate called after throwing an instance of 'tndm::petsc_error'
  what():  Invalid argument
srun: error: nid005014: task 0: Aborted

lumi_circhole_gpu.txt lumi_circhole_cpu.txt

ok, I see, on heisenbug, GPU==CPU for the problematic setup if I remove the matrix free option. But... if you look at the first message of this issue, you can see that the ridgecrest setup was run without matrix free option too, so maybe it is not the core problem

Looking backwards may not be super helpful as so many things have changed in both the software stack and the solver options being used. The test from 1 year ago used these smoother/coarse solver options -mg_coarse_pc_type gamg -mg_levels_pc_type bjacobi. My suggestion a year ago was to investigate if the differences arose from these specific options - this wasn't pursued so we didn't learn anything new. We cannot go back in time and re-create this software stack on machines we care about today. Even if we could, it's not a productive use of anyones time.

Let's look forwards and work towards resolving any issues that we have today with: the latest petsc; current up-to-date software stacks; and importantly on machines LMU promised to make tandem run on in exchange for EU funding.

@Thomas-Ulrich I've put all the LUMI-G mods (including yours) into dmay/petsc_dev_hip. I've also resolved the error encountered using --matrix_free yes here (https://github.com/TEAR-ERC/tandem/commit/b063726226bc34113f14847bc3c3eb71b8d478b3) which is also part of this branch.

Collectively these changes enable me to run your example with a bjacobi/ilu smoother and GAMG as the coarse grid PC and get identical convergence on the CPU and GPU.

Hi @hpc4geo, Using dmay/petsc_dev_hip, on heisenbug, GPU==CPU for the problematic setup also with the matrix-free option, and GPU==CPU for the Ridgecrest setup. So we can close the issue when the branch is merged. Thank you, and sorry for not following up on your initial suggestions right away when you created the issue.

TEAR-ERC / tandem

Multigrid on GPU yields different results to CPU #52