Added high-speed interconnect support by enabling UCX

shankar1729 commented 3 years ago

Checklist

[x] Used a personal fork of the feedstock to propose changes

Fixes #42 Fixes #38 (missing openib support). Instead of enabling ibverbs, which is marked as deprecated within openmpi anyway, enabling ucx support is much easier and cleaner as the ucx package is already available on conda-forge.

Adding ucx as a dependency (during build and run) automatically gets openmpi to include support for a number of high-speed interconnects (Infiniband, Omnipath etc.). No change to the build scripts beyond the dependency in meta.yaml!

conda-forge-linter commented 3 years ago

Hi! This is the friendly automated conda-forge-linting service.

I just wanted to let you know that I linted all conda-recipes in your PR (recipe) and found it was in an excellent condition.

jakirkham commented 3 years ago

Edited OP to note issues fixed. Hope that is ok :)

jakirkham commented 3 years ago

Also I think we need the --with-ucx flag in build.sh’s call to ./configure

jakirkham commented 3 years ago

Thanks Ravishankar! 😀

shankar1729 commented 3 years ago

Great! Hope it's useful. OpenMPI's configure seems to enable UCX automatically if it finds the headers. So I didn't need to add --with-ucx since its headers and libs get pulled into the standard path in the build environment.

shankar1729 commented 3 years ago

Seems fine to me: I'm new to conda packaging though, so I'm not yet familiar with how to test these properly.

jakirkham commented 3 years ago

No worries and thanks again for working on this 🙂

leofang commented 3 years ago

@shankar1729 @jakirkham Thanks. I didn't know UCX is available on Conda-Forge. A couple questions:

Is the ucx package smart enough to locate necessary nic drivers, modules, etc, in situations where they are installed in non standard paths?
Can we list ucx in "run_constrained" instead of "run"? This makes it optional (but would be used if present). I am assuming ucx is dlopen'd at runtime so that we can do this.
On CUDA awareness: I see that ucx is compiled using CUDA. Does it work for all CUDA versions? Would Open MPI's CUDA awareness be impacted if we go through ucx instead of smcuda? If UCX is installed, can we set its CUDA support off by default (as it'd look up the pointer attribute every time, which incurs extra overhead)? Do we still use the same MCA parameter opal_cuda_support to turn it on?

shankar1729 commented 3 years ago

I don't know about this one.
I checked by force-removing ucx. It causes warnings during mpi startup about mca_osc_ucx and mca_pml_ucx due to missing libucp.so.0, but otherwise works fine with alternate (low-speed) interfaces.
OpenMPI automatically uses smcuda when applicable for multiple-GPUs on a single node. As far as I understand, the node-to-node GPU communications don't work in the present conda openmpi without ucx. All my GPUs are on one node, so I can't test this easily.

leofang commented 3 years ago

I am testing this locally and can only get CUDA awareness work by setting UCX_MEMTYPE_CACHE=n

UCX_MEMTYPE_CACHE=n mpirun -n 4 ./mpi_hello_world_cuda

Seems to be documented here.

I am in favor of getting the ucx support in, but do not enable it by default. @shankar1729 Do you mind if I push to your branch? I'll make necessary changes (setting default mca parameters, add user instructions, etc) following the same treatment for plain (non-UCX) CUDA awareness.

jakirkham commented 3 years ago

Responded in issue ( https://github.com/conda-forge/ucx-split-feedstock/issues/66 ). Though it will require some more investigation

Should add am in the middle of other work. So it might take a bit before I have bandwidth to look into this

shankar1729 commented 3 years ago

@leofang Yes, please do use my branch to stage these changes.

Also, with respect to the cuda builds of UCX, it seems the build recipe has been modified to support it, but the released versions on conda-forge are without cuda.

Consequently, in my local tests I need to bypass the ucx pml when using GPU memory. I have always run my gpu codes with --mca pml ob1 when using openmpi (else I get a segfault in libucp.so), so ignore my response to point 3 above. With the current released ucx package linked to openmpi, openmpi bypasses ucx because I specified this switch in my tests.

Best, Shankar

leofang commented 3 years ago

Pending https://github.com/conda-forge/ucx-split-feedstock/pull/89.

leofang commented 3 years ago

Pending conda-forge/ucx-split-feedstock#89.

It's just in, but I'll likely have to resume on Friday or later.

leofang commented 3 years ago

Debugging the dependency conflict using mambabuild...

leofang commented 3 years ago

@conda-forge-admin, please restart ci

isuruf commented 3 years ago

Why does this need multiple builds? Can't this be one single build as it used to be?

leofang commented 3 years ago

Why does this need multiple builds? Can't this be one single build as it used to be?

So for CUDA it needs two builds: 9.2 and 10.0+. It is because UCX changes the internal code for CUDA 10.0+, see the discussion around https://github.com/conda-forge/ucx-split-feedstock/issues/66#issuecomment-813944211.

As for the additional pure CPU build, my reasoning is we use ucx-proc =*=gpu as one of the host dependency, which I am not sure if it would cause problems at runtime (say because we build with GPU-enabled UCX but run with GPU-disabled UCX), so I wanted to stay on the safe side until https://github.com/conda-forge/ucx-split-feedstock/issues/66 is fixed. But perhaps I worry too much.

isuruf commented 3 years ago

It is because UCX changes the internal code for CUDA 10.0+, see the discussion around conda-forge/ucx-split-feedstock#66 (comment).

That is internal code and the public interface is the same, so I don't see any reason to have multiple builds for CUDA version.

As for the additional pure CPU build, my reasoning is we use ucx-proc =*=gpu as one of the host dependency, which I am not sure if it would cause problems at runtime (say because we build with GPU-enabled UCX but run with GPU-disabled UCX), so I wanted to stay on the safe side until conda-forge/ucx-split-feedstock#66 is fixed. But perhaps I worry too much.

That just shows that the public interface is the same, so there's no need for openmpi to have a dependence on GPU enabled or disabled UCX.

leofang commented 3 years ago

That is internal code and the public interface is the same, so I don't see any reason to have multiple builds for CUDA version.

I don't think it is true. The CUDA runtime must have the new cudaLaunchHostFunc API supported.

isuruf commented 3 years ago

I don't think it is true. The CUDA runtime must have the new cudaLaunchHostFunc API supported.

That's an internal detail of ucx and openmpi doesn't care.

leofang commented 3 years ago

I don't think it is true. The CUDA runtime must have the new cudaLaunchHostFunc API supported.

That's an internal detail of ucx and openmpi doesn't care.

@jakirkham @pentschev I think @isuruf made a convincing argument here. I think we are guarded by ucx handing the CUDA dependency for us. I am under the impression that the handling of CUDA GPU buffers in Open MPI is mutually exclusive --- it goes through solely on either ucx or smcuda, so it should not cause issues if at build time Open MPI sees a different CUDA version from what UCX was built on. I will restore to the single build as Isuru suggested.

Copying @jsquyres in case we missed something 😅

leofang commented 3 years ago

I confirmed that for all ucx builds (cpu & gpu with different CUDA versions) their headers are identical, so ucx didn't do some code generation magic that writes the build-time info (say, CUDA_VERSION) into its headers, so we should be good.

pentschev commented 3 years ago

Yeah, I agree this is a UCX implementation detail, therefore the OpenMPI package doesn't need to worry about that.

leofang commented 3 years ago

@conda-forge/openmpi this is ready!

jsquyres commented 3 years ago

FYI -- the long-awaited DDT fix came in for v4.1.x last night (https://github.com/open-mpi/ompi/pull/8837) -- we're making a new 4.1.1RC right now, and may well release 4.1.1 as early as tomorrow. Just in case this influences your thinking...

leofang commented 3 years ago

Thanks for sharing the good news, Jeff! Yeah we will get this rebuild (of v4.1.0) in, and then build v4.1.1 afterwards so it continues to have ucx optionally enabled.

leofang commented 3 years ago

Let's get it in and see how it goes! Thanks @shankar1729 @jakirkham @pentschev @isuruf @jsquyres!

leofang commented 3 years ago

I have verified locally with a very simple MPI code that the package is working. The following combination is tested:

CPU code, without UCX (mpirun -n 2 ./my_test)
CPU code, with UCX (mpirun -n 2 --mca pml ucx --mca osc ucx ./my_test)
GPU code, without UCX (mpirun -n 2 --mca opal_cuda_support 1 ./my_gpu_test)
GPU code, with UCX (mpirun -n 2 --mca pml ucx --mca osc ucx ./my_gpu_test)

For 4, unlike a local build that I tested with, for some reason it just works out of box without setting UCX_MEMTYPE_CACHE=n.

tl;dr: we are cool

conda-forge / openmpi-feedstock

Added high-speed interconnect support by enabling UCX #87