NCAR / spack-gust

Spack production user software stack on the Gust test system
4 stars 0 forks source link

ncarenv/22.08b: MPIDI_CRAY_init: GPU_SUPPORT_ENABLED is requested, but GTL library is not linked #16

Closed benkirk closed 1 year ago

benkirk commented 1 year ago

Under ncarenv/22.08b:

$ CC -o hello_world_mpi /glade/u/home/benkirk/hello_world_mpi.C -fopenmp && ./hello_world_mpi 
MPICH ERROR [Rank 0] [job id ] [Wed Aug 31 10:43:55 2022] [gu0001] - Abort(-1) (rank 0 in comm 0): MPIDI_CRAY_init: GPU_SUPPORT_ENABLED is requested, but GTL library is not linked
 (Other MPI error)

aborting job:
MPIDI_CRAY_init: GPU_SUPPORT_ENABLED is requested, but GTL library is not linked

Reverting to ncarenv/22.08 reveals that stack is unaffected.

vanderwb commented 1 year ago

So this came about after Jeremy reported needing to specify MPICH_GPU_SUPPORT_ENABLED 1 in his batch jobs to use GPUs with cray-mpich. I figured adding it to the cray-mpich module would be a convenience to users as online documentation suggests it should only potentially add a small launch delay, but obviously it can't be set for applications built without GPU hooks.

I believe we need to add this variable to our documentation. I'll take a pass at that.

vanderwb commented 1 year ago

Added the following text to our draft doc:

If you are using an MPI application compiled with GPU support, you will need to set/export the environment variable MPICH_GPU_SUPPORT_ENABLED=1 before calling the MPI launcher in your job. This variable, when active, tells cray-mpich to enable CUDA functionality.

roryck commented 1 year ago

Longer term on Derecho, do you think it's worth looking at a PBS hook or mpiexec wrapper that sets this automatically only when GPUs are being requested?

vanderwb commented 1 year ago

I think it is worth exploring. Only potential problem I see is if they try to run a non-GPU MPI app in the same job. In some cases we probably want that to break, but maybe there are legit use cases?

vanderwb commented 1 year ago

Resolved by inclusion into the get_local_rank helper script.