NCAR / spack-derecho

Spack production user software stack on the Derecho system
3 stars 0 forks source link

Error running nsys under mpiexec #23

Open roryck opened 1 day ago

roryck commented 1 day ago
Currently Loaded Modules:
  1) ncarenv/23.09 (S)   2) craype/2.7.23   3) intel/2023.2.1   4) ncarcompilers/1.0.0   5) cuda/12.2.1   6) cray-mpich/8.1.27

This works w/o mpi

> nsys profile ./hello
----- ----- -----
Using 1 MPI Ranks and GPUs
----- ----- -----
Message before GPU computation: xxxxxxxxxxxx
Generating '/glade/derecho/scratch/rory/tmp/nsys-report-7b53.qdstrm'
[1/1] [========================100%] report4.nsys-rep
Generated:
    /glade/u/home/rory/tests/mpi_cuda_hello/report4.nsys-rep

but gives a weird error about not being installed when run with mpi

> mpiexec -n 4 nsys profile ./hello
Error: Nsight Systems 2023.2.3 hasn't been installed with CUDA Toolkit 12.2
Error: Nsight Systems 2023.2.3 hasn't been installed with CUDA Toolkit 12.2
deg0061.hsn.de.hpc.ucar.edu: rank 0 exited with code 1
deg0061.hsn.de.hpc.ucar.edu: rank 2 died from signal 15
benkirk commented 1 day ago

do you think the mpiexec --no-transfer flag might help?

roryck commented 1 day ago

Indeed, that seems to do it. Good thought there!

vanderwb commented 1 day ago

I've sometimes thought about setting PALS_TRANSFER=false globally, but am leery of side-effects. Maybe something to explore coming out of an outage?

benkirk commented 1 day ago

I do like the idea of testing PALS_TRANSFER=false in the future, it seems like something that maybe made sense when Cray defaulted to static linking, but now doesn't really do much good when all our 'executables' still have a bunch of shared library filesystem dependencies

roryck commented 1 day ago

I think it's worth testing globally. Could probably do a reasonable job of testing for (performance) side-effects w/ an AWT workload.