NCAR / spack-derecho

Spack production user software stack on the Derecho system
3 stars 0 forks source link

Error running nsys under mpiexec #23

Open roryck opened 1 week ago

roryck commented 1 week ago
Currently Loaded Modules:
  1) ncarenv/23.09 (S)   2) craype/2.7.23   3) intel/2023.2.1   4) ncarcompilers/1.0.0   5) cuda/12.2.1   6) cray-mpich/8.1.27

This works w/o mpi

> nsys profile ./hello
----- ----- -----
Using 1 MPI Ranks and GPUs
----- ----- -----
Message before GPU computation: xxxxxxxxxxxx
Generating '/glade/derecho/scratch/rory/tmp/nsys-report-7b53.qdstrm'
[1/1] [========================100%] report4.nsys-rep
Generated:
    /glade/u/home/rory/tests/mpi_cuda_hello/report4.nsys-rep

but gives a weird error about not being installed when run with mpi

> mpiexec -n 4 nsys profile ./hello
Error: Nsight Systems 2023.2.3 hasn't been installed with CUDA Toolkit 12.2
Error: Nsight Systems 2023.2.3 hasn't been installed with CUDA Toolkit 12.2
deg0061.hsn.de.hpc.ucar.edu: rank 0 exited with code 1
deg0061.hsn.de.hpc.ucar.edu: rank 2 died from signal 15
benkirk commented 1 week ago

do you think the mpiexec --no-transfer flag might help?

roryck commented 1 week ago

Indeed, that seems to do it. Good thought there!

vanderwb commented 1 week ago

I've sometimes thought about setting PALS_TRANSFER=false globally, but am leery of side-effects. Maybe something to explore coming out of an outage?

benkirk commented 1 week ago

I do like the idea of testing PALS_TRANSFER=false in the future, it seems like something that maybe made sense when Cray defaulted to static linking, but now doesn't really do much good when all our 'executables' still have a bunch of shared library filesystem dependencies

roryck commented 1 week ago

I think it's worth testing globally. Could probably do a reasonable job of testing for (performance) side-effects w/ an AWT workload.