Open jaelynlitz opened 2 weeks ago
PM has a100s - it may be beneficial to switch our test pipelines to the a100 partition(s) on Deception. The p100 build of haero works on the a100s (with a warning that using compute capability 6.0 may be less performant since it's a compute capability 8.0 node), but building haero with Kokkos_ARCH_AMPERE80=ON
results in the following error:
/qfs/projects/eagles/litz372/mam4xx/second_new_haero_gpu_double_debug/include/kokkos/Cuda/Kokkos_Cuda_Half_Conversion.hpp(409): error: identifier "__ushort2bfloat16_rn" is undefined
/qfs/projects/eagles/litz372/mam4xx/second_new_haero_gpu_double_debug/include/kokkos/Cuda/Kokkos_Cuda_Half_Conversion.hpp(411): error: identifier "__int2bfloat16_rn" is undefined
/qfs/projects/eagles/litz372/mam4xx/second_new_haero_gpu_double_debug/include/kokkos/Cuda/Kokkos_Cuda_Half_Conversion.hpp(413): error: identifier "__uint2bfloat16_rn" is undefined
/qfs/projects/eagles/litz372/mam4xx/second_new_haero_gpu_double_debug/include/kokkos/Cuda/Kokkos_Cuda_Half_Conversion.hpp(415): error: identifier "__ll2bfloat16_rn" is undefined
/qfs/projects/eagles/litz372/mam4xx/second_new_haero_gpu_double_debug/include/kokkos/Cuda/Kokkos_Cuda_Half_Conversion.hpp(417): error: identifier "__ull2bfloat16_rn" is undefined
/qfs/projects/eagles/litz372/mam4xx/second_new_haero_gpu_double_debug/include/kokkos/Cuda/Kokkos_Cuda_Half_Conversion.hpp(407): error: identifier "__short2bfloat16_rn" is undefined
/qfs/projects/eagles/litz372/mam4xx/second_new_haero_gpu_double_debug/include/kokkos/Cuda/Kokkos_Cuda_Half_Conversion.hpp(409): error: identifier "__ushort2bfloat16_rn" is undefined
/qfs/projects/eagles/litz372/mam4xx/second_new_haero_gpu_double_debug/include/kokkos/Cuda/Kokkos_Cuda_Half_Conversion.hpp(411): error: identifier "__int2bfloat16_rn" is undefined
/qfs/projects/eagles/litz372/mam4xx/second_new_haero_gpu_double_debug/include/kokkos/Cuda/Kokkos_Cuda_Half_Conversion.hpp(413): error: identifier "__uint2bfloat16_rn" is undefined
/qfs/projects/eagles/litz372/mam4xx/second_new_haero_gpu_double_debug/include/kokkos/Cuda/Kokkos_Cuda_Half_Conversion.hpp(415): error: identifier "__ll2bfloat16_rn" is undefined
/qfs/projects/eagles/litz372/mam4xx/second_new_haero_gpu_double_debug/include/kokkos/Cuda/Kokkos_Cuda_Half_Conversion.hpp(417): error: identifier "__ull2bfloat16_rn" is undefined
/qfs/projects/eagles/litz372/mam4xx/second_new_haero_gpu_double_debug/include/kokkos/Cuda/Kokkos_Cuda_Half_Conversion.hpp(407): error: identifier "__short2bfloat16_rn" is undefined
/qfs/projects/eagles/litz372/mam4xx/second_new_haero_gpu_double_debug/include/kokkos/Cuda/Kokkos_Cuda_Half_Conversion.hpp(409): error: identifier "__ushort2bfloat16_rn" is undefined
/qfs/projects/eagles/litz372/mam4xx/second_new_haero_gpu_double_debug/include/kokkos/Cuda/Kokkos_Cuda_Half_Conversion.hpp(411): error: identifier "__int2bfloat16_rn" is undefined
/qfs/projects/eagles/litz372/mam4xx/second_new_haero_gpu_double_debug/include/kokkos/Cuda/Kokkos_Cuda_Half_Conversion.hpp(413): error: identifier "__uint2bfloat16_rn" is undefined
/qfs/projects/eagles/litz372/mam4xx/second_new_haero_gpu_double_debug/include/kokkos/Cuda/Kokkos_Cuda_Half_Conversion.hpp(415): error: identifier "__ll2bfloat16_rn" is undefined
/qfs/projects/eagles/litz372/mam4xx/second_new_haero_gpu_double_debug/include/kokkos/Cuda/Kokkos_Cuda_Half_Conversion.hpp(417): error: identifier "__ull2bfloat16_rn" is undefined
6 errors detected in the compilation of "/qfs/projects/eagles/litz372/mam4xx/src/mam4xx/aero_modes.cpp".```
Build works and all tests on pass w/ CPU build (gcc11.2.0 + cuda11.7). Will ask RC team about the error above for GPU build. (clarification: haero builds with gcc11.2.0 + cuda11.7 then mam4xx throws the above error when trying to compile).
Not using CI to test yet bc the current haero build works, so I don't want to rebuild and break CI until I've figured out what to do.
Discussed when finding some build issues with haero's cpu build -
It is proposed that we upgrade the build to match Perlmutter - gcc/11.2.0 + cuda/11.7. These are both available on Deception.