Memory issue (`cuSPARSE_STATUS_INSUFFICIENT_RESOURCES`) when running Tandem-Static Mini-App on Leonardo HPC System

Description


System	Leonardo Booster
Branch	`dmay/petsc_dev_hip` (fixes residual convergence difference between CPU and GPU)
Commit ID	`1015c31d0f29eab4983497a3ad3f607057285388`
Backends	CUDA via PETSc
Target	`static`

I'm encountering errors when running the Tandem mini-app static on the Leonardo Booster HPC system. Specifically,

the yateto kernels test fails during execution.
A cuSPARSE_STATUS_INSUFFICIENT_RESOURCES error occurs when launching the mini-app on less than (~) 48 nodes with 4 gpus per node (4*64*48 GB). I am not sure whether it is simply that the problem size is too large or something else.

Problem setup

Get audit scenario

wget https://syncandshare.lrz.de/dl/fi34J422UiAKKnYKNBkuTR/audit-scenario.zip
unzip audit-scenario.zip

Create intermediate size mesh with gmsh (same setup as Eviden-WP3)

gmsh fault_many_wide.geo -3 -setnumber h 10.0 -setnumber h_fault 0.25 -o fault_many_wide.msh

Change mesh in ridge.toml

mesh_file = "fault_many_wide.msh"
#mesh_file = "fault_many_wide_4_025.msh"

type = "elasticity"

matrix_free = true

ref_normal = [0, -1, 0]
lib = "scenario_ridgecrest.lua"
scenario = "shaker"
#[domain_output]

Steps to reproduce errors

Attempt 1: Use system installation of Petsc@3.20.1

Click to expand

**Load Modules** ```shell module purge module load petsc/3.20.1--openmpi--4.1.6--gcc--12.2.0-cuda-12.1-mumps # <---petsc module load cuda/12.1 module load eigen/3.4.0--gcc--12.2.0-5jcagas module load spack/0.21.0-68a module load cmake/3.27.7 ``` **Spack environment for Lua and Python+Numpy dependencies** ```shell spack create -d ./spack-env-tandem spack env activate ./spack-env-tandem -p spack add py-numpy lua@5.4.4 spack concretize -f spack install ``` **Install CSV module** ```shell luarocks install csv ``` **Clone Tandem** ```shell git clone -b dmay/petsc_dev_hip https://github.com/TEAR-ERC/tandem.git tandem-petsc_dev_hip cd tandem-petsc_dev_hip && git submodule update --init cd .. ``` **Build Tandem** Note: Petsc on Leonardo has been installed without a specific value for `--with-memalign`. When running the CMake configuration step ```shell cmake -B ./build -S ./tandem-petsc_dev_hip -DCMAKE_C_COMPILER=mpicc -DCMAKE_CXX_COMPILER=mpicxx -DPOLYNOMIAL_DEGREE=4 -DDOMAIN_DIMENSION=3 ``` I get the following error ```shell -- Could NOT find LibxsmmGenerator (missing: LibxsmmGeneratorExecutable) CMake Error at app/CMakeLists.txt:72 (message): The memory alignment of PETSc is 16 bytes but an alignment of at least 32 bytes is required for ARCH=hsw. Please compile PETSc with --with-memalign=32. ``` and so I temporarily commented out Tandem's requirement on memory alignment for Petsc in `app/CMakeLists.txt`(just to verify whether I got the same error as for the custom installation of Petsc) ```CMakeLists.txt #[=[ if(PETSC_MEMALIGN LESS ALIGNMENT) message(SEND_ERROR "The memory alignment of PETSc is ${PETSC_MEMALIGN} bytes but an alignment of " "at least ${ALIGNMENT} bytes is required for ARCH=${ARCH}. " "Please compile PETSc with --with-memalign=${ALIGNMENT}.") endif() #]=] ``` and then I build and run the tests on a login node ```shell cmake --build ./build --parallel 4 ctest --test-dir ./build ``` where the `yateto kernels` failed ```shell ctest --test-dir ./build --rerun-failed ``` ```shell Start testing: Oct 13 11:39 CEST ---------------------------------------------------------- 3/21 Testing: yateto kernels 3/21 Test: yateto kernels Command: "/leonardo_work/cin_staff/mredenti/ChEESE/TANDEM/build/app/test-elasticity-kernel" "--test-case=yateto kernels" Directory: /leonardo_work/cin_staff/mredenti/ChEESE/TANDEM/build/app "yateto kernels" start time: Oct 13 11:39 CEST Output: ---------------------------------------------------------- [doctest] doctest version is "2.3.7" [doctest] run with "--help" for options =============================================================================== /leonardo_work/cin_staff/mredenti/ChEESE/TANDEM/build/app/kernels/elasticity/test-kernel.cpp:10: TEST CASE: yateto kernels apply_inverse_mass /leonardo_work/cin_staff/mredenti/ChEESE/TANDEM/build/app/kernels/elasticity/test-kernel.cpp:4938: ERROR: CHECK( sqrt(error/refNorm) < 2.22e-14 ) is NOT correct! values: CHECK( 0.0 < 0.0 ) =============================================================================== [doctest] test cases: 1 | 0 passed | 1 failed | 0 skipped [doctest] assertions: 65 | 64 passed | 1 failed | [doctest] Status: FAILURE! Test time = 0.05 sec ---------------------------------------------------------- Test Failed. "yateto kernels" end time: Oct 13 11:39 CEST "yateto kernels" time elapsed: 00:00:00 ---------------------------------------------------------- End testing: Oct 13 11:39 CEST ``` **Running the audit scenario test case** ```shell #!/bin/bash #SBATCH -A #SBATCH -p boost_usr_prod #SBATCH --time 00:10:00 # format: HH:MM:SS #SBATCH -N 4 # 1 node #SBATCH --ntasks-per-node=4 # 4 tasks out of 32 #SBATCH --cpus-per-task=8 #SBATCH --exclusive #SBATCH --gres=gpu:4 # 4 gpus per node out of 4 #SBATCH --job-name=my_batch_job module purge module load petsc/3.20.1--openmpi--4.1.6--gcc--12.2.0-cuda-12.1-mumps # <---petsc module load cuda/12.1 module load eigen/3.4.0--gcc--12.2.0-5jcagas module load spack/0.21.0-68a module load cmake/3.27.7 # activate spack env spack env activate $WORK/mredenti/ChEESE/TANDEM/spack-env-tandem srun bash \ -c 'export CUDA_VISIBLE_DEVICES=$((SLURM_LOCALID % 4)); \ exec ./static \ ridge.toml \ --output ridgecrest \ --mg_strategy twolevel \ --mg_coarse_level 1 \ --petsc \ -ksp_view \ -ksp_monitor \ -ksp_converged_reason \ -ksp_max_it 40 \ -pc_type mg \ -mg_levels_ksp_max_it 4 \ -mg_levels_ksp_type cg \ -mg_levels_pc_type bjacobi \ -options_left \ -ksp_rtol 1.0e-6 \ -mg_coarse_pc_type gamg \ -mg_coarse_ksp_type cg \ -mg_coarse_ksp_rtol 1.0e-1 \ -mg_coarse_ksp_converged_reason \ -ksp_type gcr \ -vec_type cuda \ -mat_type aijcusparse \ -ksp_view -log_view' ``` i get the aforementioned `cuSPARSE_STATUS_INSUFFICIENT_RESOURCES` error. See log [slurm-tandem_cusparse_error.log](https://github.com/user-attachments/files/17355258/slurm-tandem_cusparse_error.log) Note: It seems I have to go up to 48 nodes to have enough memory. See log [slurm-tandem_cusparse_success_48nodes.log](https://github.com/user-attachments/files/17355260/slurm-tandem_cusparse_success_48nodes.log)

Attempt 2: Instal Petsc@3.21.5 from source

Note: Even when I install Petsc from source the resulting outcome is not any different from the errors documented in attempt one, and therefore I will only report the installation steps of Petsc

Click to expand

**Set Petsc Version** ```shell export PETSC_VERSION=3.21.5 ``` **Clone Petsc** ```shell git clone -b v$PETSC_VERSION https://gitlab.com/petsc/petsc.git petsc-$PETSC_VERSION ``` **Petsc Installation** ```shell #!/bin/bash #SBATCH -A #SBATCH -p lrd_all_serial #SBATCH --time 00:30:00 #SBATCH -N 1 #SBATCH --ntasks-per-node=1 #SBATCH --cpus-per-task=4 #SBATCH --exclusive #SBATCH --gres=gpu:0 #SBATCH --job-name=petsc_installation_3_21_5 module load gcc/12.2.0 module load openmpi/4.1.6--gcc--12.2.0 module load cuda/12.1 module load superlu-dist/8.1.2--openmpi--4.1.6--gcc--12.2.0-cuda-12.1-zsspaca module load metis/5.1.0--gcc--12.2.0 module load mumps/5.5.1--openmpi--4.1.6--gcc--12.2.0-4hwekmx module load parmetis/4.0.3--openmpi--4.1.6--gcc--12.2.0 module load cmake/3.27.7 module load openblas/0.3.24--gcc--12.2.0 module load hypre/2.29.0--openmpi--4.1.6--gcc--12.2.0-cuda-12.1-iln2jw4 module load netlib-scalapack/2.2.0--openmpi--4.1.6--gcc--12.2.0 module load eigen/3.4.0--gcc--12.2.0-5jcagas module load fftw/3.3.10--openmpi--4.1.6--gcc--12.2.0 module load cmake/3.27.7 module load spack/0.21.0-68a spack env activate $WORK/mredenti/ChEESE/TANDEM/spack-env-tandem export PETSC_VERSION=3.21.5 cd petsc-${PETSC_VERSION} ./config/configure.py \ --prefix=$WORK/mredenti/ChEESE/TANDEM/petsc-${PETSC_VERSION}-opt \ --with-ssl=0 \ --download-c2html=0 \ --download-sowing=0 \ --download-hwloc=0 \ --with-cc=${MPICC} \ --with-cxx=${MPICXX} \ --with-fc=${MPIF90} \ --with-precision=double \ --with-scalar-type=real \ --with-shared-libraries=1 \ --with-debugging=0 \ --with-openmp=0 \ --with-64-bit-indices=0 \ --with-blaslapack-lib=${OPENBLAS_LIB}/libopenblas.so \ --with-x=0 \ --with-clanguage=C \ --with-cuda=1 \ --with-cuda-dir=${CUDA_HOME} \ --with-hip=0 \ --with-metis=1 \ --with-metis-include=${METIS_INC} \ --with-metis-lib=${METIS_LIB}/libmetis.so \ --with-hypre=1 \ --with-hypre-include=${HYPRE_INC} \ --with-hypre-lib=${HYPRE_LIB}/libHYPRE.so \ --with-parmetis=1 \ --with-parmetis-include=${PARMETIS_INC} \ --with-parmetis-lib=${PARMETIS_LIB}/libparmetis.so \ --with-kokkos=0 \ --with-kokkos-kernels=0 \ --with-superlu_dist=1 \ --with-superlu_dist-include=${SUPERLU_DIST_INC} \ --with-superlu_dist-lib=${SUPERLU_DIST_LIB}/libsuperlu_dist.so \ --with-ptscotch=0 \ --with-suitespars \ --with-zlib=1 \ --with-zlib-include=${ZLIB_INC} \ --with-zlib-lib=${ZLIB_LIB}/libz.so \ --with-mumps=1 \ --with-mumps-include=${MUMPS_INC} \ --with-mumps-lib="${MUMPS_LIB}/libcmumps.so ${MUMPS_LIB}/libsmumps.so ${MUMPS_LIB}/libdmumps.so ${MUMPS_LIB}/libzmumps.so ${MUMPS_LIB}/libmumps_common.so ${MUMPS_LIB}/libpord.so" \ --with-trilinos=0 \ --with-fftw=1 \ --with-fftw-include=${FFTW_INC} \ --with-fftw-lib="${FFTW_LIB}/libfftw3_mpi.so ${FFTW_LIB}/libfftw3.so" \ --with-valgrind=0 \ --with-gmp=0 \ --with-libpng=0 \ --with-giflib=0 \ --with-mpfr=0 \ --with-netcdf=0 \ --with-pnetcdf=0 \ --with-moab=0 \ --with-random123=0 \ --with-exodusii=0 \ --with-cgns=0 \ --with-memkind=0 \ --with-memalign=64 \ --with-p4est=0 \ --with-saws=0 \ --with-yaml=0 \ --with-hwloc=0 \ --with-libjpeg=0 \ --with-scalapack=1 \ --with-scalapack-lib=${NETLIB_SCALAPACK_LIB}/libscalapack.so \ --with-strumpack=0 \ --with-mmg=0 \ --with-parmmg=0 \ --with-tetgen=0 \ --with-cuda-arch=80 \ --FOPTFLAGS=-O3 \ --CXXOPTFLAGS=-O3 \ --COPTFLAGS=-O3 make PETSC_DIR=$WORK/mredenti/ChEESE/TANDEM/petsc-${PETSC_VERSION} PETSC_ARCH="arch-linux-c-opt" all make PETSC_DIR=$WORK/mredenti/ChEESE/TANDEM/petsc-${PETSC_VERSION} PETSC_ARCH=arch-linux-c-opt install ``` **Check Petsc installation on GPU node** ```shell #!/bin/bash #SBATCH -A cin_staff #SBATCH -p boost_usr_prod #SBATCH -q boost_qos_dbg #SBATCH --time 00:10:00 #SBATCH -N 1 #SBATCH --ntasks-per-node=1 #SBATCH --cpus-per-task=4 ##SBATCH --exclusive #SBATCH --gres=gpu:1 #SBATCH --job-name=petsc_test_installation module load gcc/12.2.0 module load openmpi/4.1.6--gcc--12.2.0 module load cuda/12.1 module load superlu-dist/8.1.2--openmpi--4.1.6--gcc--12.2.0-cuda-12.1-zsspaca module load metis/5.1.0--gcc--12.2.0 module load mumps/5.5.1--openmpi--4.1.6--gcc--12.2.0-4hwekmx module load parmetis/4.0.3--openmpi--4.1.6--gcc--12.2.0 module load openblas/0.3.24--gcc--12.2.0 module load hypre/2.29.0--openmpi--4.1.6--gcc--12.2.0-cuda-12.1-iln2jw4 module load netlib-scalapack/2.2.0--openmpi--4.1.6--gcc--12.2.0 module load eigen/3.4.0--gcc--12.2.0-5jcagas module load fftw/3.3.10--openmpi--4.1.6--gcc--12.2.0 module load spack/0.21.0-68a module load zlib/1.2.13--gcc--12.2.0-b3ocy4r module load cmake/3.27.7 # activate spack env spack env activate $WORK/mredenti/ChEESE/TANDEM/spack-env-tandem export PETSC_VERSION=3.21.5 cd petsc-${PETSC_VERSION} make PETSC_DIR=$WORK/mredenti/ChEESE/TANDEM/petsc-${PETSC_VERSION}-opt PETSC_ARCH="" check make -j 4 -f $WORK/mredenti/ChEESE/TANDEM/petsc-${PETSC_VERSION}-opt/share/petsc/examples/gmakefile.test test ```

TEAR-ERC / tandem