argonne-lcf / user-guides

ALCF Systems User Documentation
https://docs.alcf.anl.gov/
20 stars 29 forks source link

Document Kokkos >= 4.2.x, <= 4.5.x issues with Cray MPICH and CUDA async memory allocations on Polaris #489

Open felker opened 2 months ago

felker commented 2 months ago

LAMMPS, AthenaK, XGC, and other Kokkos-based applications using versions starting with 4.2.00 in Nov 2023 are affected by an incompatibility with Cray MPICH (based on an older UCX) of the new default option:

-DKokkos_ENABLE_IMPL_CUDA_MALLOC_ASYNC=ON

A runtime error is thrown by CUDA-aware Cray MPICH if you try to use Kokkos with that option enabled:

(GTL DEBUG: 2) cuIpcGetMemHandle: invalid argument, CUDA_ERROR_INVALID_VALUE, line no 148

I assume the 3x prebuilt Kokkos modules were all compiled with that option disabled @zippylab ?

   kokkos/4.2.01/shared/PrgEnv-gnu/8.5.0/gnu/12.3/cuda_cudatoolkit_12.2.91
   kokkos/4.2.01/shared/PrgEnv-gnu/8.5.0/gnu/12.3/cuda_cudatoolkit_12.3.2
   kokkos/4.3.01_shared_PEg8.5.0_cv12.3_ct12.2.91                          (D)

There is a discussion to potentially revert the change to the default in 4.5.x https://github.com/kokkos/kokkos/pull/7353

zippylab commented 2 months ago

The kokkos/4.3.01_shared_PEg8.5.0_cv12.3_ct12.2.91 module build was built with cmake flag -DKokkos_ENABLE_IMPL_CUDA_MALLOC_ASYNC=OFF

The two kokkos/4.2.01 module builds were built with the default, -DKokkos_ENABLE_IMPL_CUDA_MALLOC_ASYNC=ON. I believe those will have trouble if you enable GPU-aware MPICH.

felker commented 1 month ago

Perhaps we should mention that in all 3x .lua modulefiles, in addition to the user guide.