NCAR / spack-gust

Spack production user software stack on the Gust test system
4 stars 0 forks source link

cuIpcGetMemHandle error #23

Open johnmauff opened 1 year ago

johnmauff commented 1 year ago

After setting the CUDA_VISIBLE_DEVICES environment variable #22 I was able to launch CM1 on multiple nodes of Gust. Unfortunately, it subsequently dies with the following error before the timestep is entered. Note that while CM1 is an OpenACC code, it does not use OpenACC to access the MPI library. Note that a similar failure was observed on Perlmutter by Supreeth Suresh and may be related to the Cray MPICH library.

(GTL DEBUG: 1) cuIpcGetMemHandle: invalid argument, CUDA_ERROR_INVALID_VALUE, line no 97

johnmauff commented 1 year ago

With the most recent update of system software on Gust the cuIpcGetMemHandle error appears to have been eliminated.

benkirk commented 1 year ago

Thanks for the update!

I am seeing the same issue, currently, with just the simple OSU benchmarks. So we are tracking it with HPE.

Glad you are unaffected now.

johnmauff commented 1 year ago

@benkirk I just realized that the error went away for me due to the removal of management memory in the makefile, and not anything with the new system software. I will discuss this with my NVIDIA contacts to see if they have any ideas.