CUDA support? - Githubissues

beckermr commented 1 year ago

Apparently, you need a particular compiler & pip invocation to get CUDA support for mpi4py.

See this page: https://docs.nersc.gov/development/languages/python/using-python-perlmutter/#mpi4py-on-perlmutter

Do we think there is a way to enable this in the conda-forge build?

cc @jakirkham @leofang

leofang commented 1 year ago

Who from NESRC wrote this? lol It makes 0 sense... mpi4py is CUDA unaware. Only the underlying MPI library matters.

leofang commented 1 year ago

Do we think there is a way to enable this in the conda-forge build?

btw it's already done. If you create a fresh conda env

conda create -n my_env python mpi4py openmpi

and follow the on-screen instruction, CUDA awareness can be kicked off. As I said, it's done through the underlying MPI (Open MPI, in this case), not by mpi4py.

beckermr commented 1 year ago

See the release notes here: https://github.com/mpi4py/mpi4py/releases/tag/3.1.0

@dalcinl is active in conda-forge and is an mpi4py maintainer.

leofang commented 1 year ago

I wrote the mpi4py support with @dalcinl, and I enabled the CUDA awareness support on conda-forge. I am not sure what you're trying to get at.

We note, in particular, that mpi4py is by itself CUDA- unaware.

https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9439927

beckermr commented 1 year ago

Ah ok. I did not realize. Thank you for the help!

leofang commented 1 year ago

No problem @beckermr! btw I am reaching out to our NERSC support persons to get that doc fixed, but if you're already ahead of me, just let me know 🙂

beckermr commented 1 year ago

I have not reached out to them. AFAIK then, there is no need to rebuild mpi4py with cuda support. We can link the conda-forge package directly to the NERSC mpi libraries.

leofang commented 1 year ago

Yes, they just need to use the "external" MPI packages. I know for certain MPICH would work on NERSC.

beckermr commented 1 year ago

Right. BTW, we don't have mpich builds with cuda support in conda-forge, right?

leofang commented 1 year ago

Nope, unfortunately MPICH requires the CUDA support to be built at compile time, and last time I checked with the MPICH devs there's no launch-time/run-time protection if CUDA is absent (link). So, unless the core devs agree to special-case for MPICH (I am looking at you Matt 😉), I don't think it's appropriate to build MPICH for each different CUDA major.minor.

beckermr commented 1 year ago

Got it. I don't think I was involved much (at all?) with the previous openmpi discussions, so I won't comment on the run-time support issue. :)

jakirkham commented 1 year ago

We might be able to improve on openmpi's CUDA support. Please see issue ( https://github.com/conda-forge/openmpi-feedstock/issues/119 ) and linked PR for more context. It still needs a bit more work, but maybe with a few more people looking at it we can sort out the remaining issues 😉

dalcinl commented 1 year ago

@leofang Maybe some of the wording you used here is not appropriate? You said mpi4py is CUDA-unaware... Well, that's a bit confusing. Perhaps the best way to say it is mpi4py inherits the GPU-awareness of the MPI backend library + mpi4py fully supports for the DLPack and CAI protocols.

leofang commented 1 year ago

Well, the ship has sailed, and I can't believe I need to quote my (our) paper twice in a day 😂

We note, in particular, that mpi4py is by itself CUDA- unaware.

https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9439927

dalcinl commented 1 year ago

Well, that comment in the paper within the context of the surrounding text is very clear what you meant. But outside that context, saying CUDA-unaware can be easily misunderstood as mpi4py does not support CUDA-aware MPIs.

leofang commented 1 year ago

Right, if we are to have a pedantic discussion, let me wear my CUDA hat back 🙂 When we say a package/project/whatever is CUDA aware, one or multiple of the following conditions should hold true:

It require the presence of CUDA Toolkit at build- and/or run- time
It needs to be built by nvcc (for offline compilation) or linked to NVRTC (for JIT compilation)
It links to libcuda (driver) or libcudart (runtime) and uses symbols therein
It creates or recognizes a CUDA context at run time so as to communicate with (and possibly scheduling work onto) an NVIDIA GPU
It links to CUDA libraries for domain-specific tasks

Clearly, none of these applies to mpi4py (despite we put significant efforts in supporting it!) I could have said "after a careful design and collaboration with the community, mpi4py is able to support Python GPU libraries without being aware of CUDA," but this is just awfully mouthy. From the packaging perspective, it's a lot easier if we just say mpi4py is CUDA unaware. Pedantic discussions should be left to a design issue or a paper IMHO.

heather999 commented 1 year ago

Just wanted to reach out concerning CUDA on NERSC Perlmutter GPU nodes. I see this comment which gives me hope. I create a very simple conda environment, include the "external" mpich package, install mpi4py from conda-forge (and that's it!) and the NERSC system mpich library (cray-mpich-abi) is available and seems to run correctly on Perlmutter's CPU-only compute nodes. However, repeating the same test of the conda environment on Perlmutter's GPU nodes fails with an error about the GTL library. I have tried this with both NERSC's cudatoolkit module (11.7) and the cudatoolkit (11.7.0) installed from conda-forge - as one of those needs to be available. NERSC has a variety of comments about this specific issue with the GTL library and notes some env variables that need to be set - and I have made sure they are.

Meanwhile, I can set up another simple conda environment, and follow NERSC's instructions, skip the external mpich package from conda-forge, do this special pip install of mpi4py and it works on the Perlmutter GPU nodes. Something is different, whether or not mpi4py is cuda-aware. I'd really prefer to use the conda-forge external mpich library though and avoid this pip install step of mpi4py if possible. I'm probably missing something very basic, is there an example where the conda-forge mpich external package and mpi4py are set up in a conda environment and this works on the Perlmutter GPU compute nodes?

leofang commented 1 year ago

Hi @heather999, sorry to hear your frustration. Unfortunately I've left DOE for a while and lost access to NERSC, so can't test it myself right away, but it should be working based on my (distant) past experience and other users' feedback.

Judging from this statement

I create a very simple conda environment, include the "external" mpich package, install mpi4py from conda-forge (and that's it!) and the NERSC system mpich library (cray-mpich-abi) is available and seems to run correctly on Perlmutter's CPU-only compute nodes.

It doesn't seem to be any ABI compatibility issue since the empty mpich + mpi4py from CF + Cray MPI works on CPU-only workloads. This statement alone is enough to say mpi4py from CF is not the problem.

Now,

and I have made sure they are

would you confirm that both are set but not working?

export MPICH_GPU_SUPPORT_ENABLED=1
export CRAY_ACCEL_TARGET=nvidia80

If so, I guess I might have a theory. It seems setting CRAY_ACCEL_TARGET would add a linker flag to cc so that it knows which shared library to link to, but if you use mpi4py from CF it's not linked to that. I think this is a design issue in Cray MPI: They should have linked libmpi.so to the transport library for the user, or do a dlopen to load the library at runtime if the env var is set. Otherwise, they put burdens on users and you wouldn't be able to use prebuilt binary packages (such as CF's mpi4py).

I would suggest to ask NERSC support about which shared library to load, and try LD_PRELOAD to force loading it. I believe this would fix the issue.

beckermr commented 1 year ago

One idea occurred to me here. We might build our own copy of the mpi4py package in a local channel at NERSC with the correct linkages. Then we can tuck this into a higher priority channel so it gets pulled in first.

jakirkham commented 1 year ago

This is a strategy that has worked well in different contexts

One thing to keep in mind is how channels/labels get setup. Namely if there are other channels in use for say NERSC or DOE products, would recommend keeping them separate from say modified conda-forge packages. Doing a little upfront work to set things up right can be a bit of a drag, but it beats trying to fix things later when people depend on them. Just something to keep in mind 😉

beckermr commented 1 year ago

Ahhhh pro tip! Thanks! If you all think of anything else, let me know.

dalcinl commented 1 year ago

We might build our own copy of the mpi4py ... with the correct linkages.

@beckermr What do you exactly mean with this? What is linking is incorrect? All what should be needed is for libmpi.so.12 to be found in LD_LIBRARY_PATH. Am I missing something?

beckermr commented 1 year ago

The HPC center I work at has special compiler flags for linking mpich with CUDA awareness. They link some libs directly to mpi4py instead of to libmpi. So we would try to build a package there where we have access to the libs and can link things properly.

dalcinl commented 1 year ago

The HPC center I work at has special compiler flags for linking mpich with CUDA awareness.

Is this information available somewhere?

They link some libs directly to mpi4py instead of to libmpi.

Awful.

So we would try to build a package there where we have access to the libs and can link things properly.

Maybe there is way to create a libmpi.so.12 file that links to all the other MPI+CUDA stuff (as done currently with mpi4py), and then you point LD_LIBRARY_PATH to it.

beckermr commented 1 year ago

Right. We have to do something custom. A recipe and local channel seems less painful than special libs+linking, but I am not 100% sure.

dalcinl commented 1 year ago

A recipe and local channel seems less painful than special libs+linking

IMHO, creating a specially crafted lib and a module file appending to LD_LIBRARY_PATH for users to module load is less painful (for users) than having to use special channels. Of course, I'm talking without knowing the specific details of the system.

beckermr commented 1 year ago

Hmmmmmm. I had not considered using the module system. Maybe the right thing is to ask the admins to do the extra linkage for us if it works. I think Leo mentioned this. Thanks for the input!

jakirkham commented 1 year ago

Yeah agree with Lisandro. The downside of adding a package is it now needs to be maintained indefinitely (and who maintains it?). Adding some local machine configuration (module load or otherwise) only needs to be maintained on that machine (and by people who do that maintenance). Something to consider

conda-forge / mpi4py-feedstock

CUDA support? #62