Open garlick opened 5 years ago
My guess is this is because Spectrum MPI dlopens libpami_cudahook.so
. I suspect you can avoid this error by setting LD_PRELOAD
to the path to libpami_cudahook.so
. With Flux's current spectrum MPI support, this SO won't be used so this should be safe in theory. You should be able to find the libpami_cudahook.so
path by looking at the environment variable under jsrun with an MPI program.
Without getting into too much detail, this is an ugly optimization technique that IBM used to allow their MPI to be able to send buffers allocated by CUDA memory allocation routines. The interception of the CUDA driver calls was achieved by wrapping dlsym
in, libpami_cudahook.so
, that is preloaded to each MPI process. But this has had lots, lots of issues, least of which was compatibility with both performance and debugging tools.
This will have to be revisited when @rountree is finishing up his PMIx work as PAMI will require this to be set correctly and we want support for tools at that point as well. I remember you could get a good mileage by putting libpami_cudahook.so
as the last path in the LD_PRELOAD
.
Hmmm. This one boggles me. The spectrum.lua
plugin does prepend /opt/ibm/spectrum_mpi/lib/libpami_cudahook.so
to the LD_PRELOAD
. [source code]. And that file seems to exist:
→ stat /opt/ibm/spectrum_mpi/lib/libpami_cudahook.so
File: ‘/opt/ibm/spectrum_mpi/lib/libpami_cudahook.so’ -> ‘libpami_cudahook.so.1’
Size: 21 Blocks: 0 IO Block: 65536 symbolic link
Device: 901h/2305d Inode: 6357621 Links: 1
Access: (0777/lrwxrwxrwx) Uid: ( 1/ bin) Gid: ( 1/ bin)
Access: 2019-05-14 23:42:50.842844635 -0700
Modify: 2019-02-12 13:13:21.741949852 -0800
Change: 2019-02-12 13:13:21.741949852 -0800
Birth: -
I have a .notce environment
I wonder if this has something to do with it. What happens if you run module use /usr/tcetmp/modulefiles/Core
, then module load StdEnv
, and then your login node flux instance + wreckrun? That should pull SpectrumMPI, the XL compiler, and most importantly Cuda into your environment:
→ module show StdEnv
<snip>
load("xl")
load("spectrum-mpi/rolling-release")
load("cuda")
Hmmm. I think we need to find who defines PAMI_CUDA_RegisterPAMIContexts
. From the symbol name of it, it looks like the PAMI library itself or its dependencies. Perhaps doing nm
on the spectrum MPI directory suggests something?
When running an mpi hello world program under Flux on lassen, I get the following FATAL ERROR (the horror!) but my program still runs just fine. Also note the lua complaint. (line breaks added for readability)
Flux was started locally on a login node (lassen708), I have a
.notce
environment, and this was run from source which git describes as v0.11.1.