Open YarShev opened 1 year ago
@jsquyres Sorry, I need your input again.
I had a quick look to ompi's configure.ac
, and now I realize what's going on. configure is using a try_run check for CMA availability. This is is suboptimal, as the build machine is not necessarily the runtime machine.
That's the problem with the conda-forge binaries, they are built in Azure Pipelines within a container without the required support. I'm not an autotools expert, but looking at ompi's config/opal_check_cma.m4
, it looks like to force CMA using --with-cma=yes
would be ineffective, configure
will stop with error.
The other situation can also happen. You use the openmpi package from a Linux distro that was built in bare metal with CMA enabled, but then you run apps within a container, things do not work out of the box, and CMA has to be explicitly disabled (example).
Would it be possible for Open MPI to move the try_run check to runtime, and if the syscalls cannot be used, then disable CMA use at runtime (maybe with a warning)? I understand other things work that way (eg. CUDA support), why not CMA support?.
The vast majority of Open MPI's infrastructure assumes that the environment where configure
is run is the same as where MPI jobs will be run. Writing it that way -- which is the same as most other Linux libraries and applications -- allows Open MPI to just -lfoo
to link in dependent libraries and let the run-time linker handle all of the chasing down of libraries at run time (which is both exceedingly complicated and exactly what the run-time linker is for).
The alternative is for Open MPI to essentially duplicate the behavior of the run-time linker (via dlopen()
calls and the searching of paths, replicating RPATH and RUNPATH behavior ... etc.). It would be a massive undertaking to switch from a -lfoo
-based architecture to an architecture based on dlopen()
+ dlsym()
+ invoke all functions via function pointers + look up all globals by name/pointer at run time. We'd also have to replicate all function prototypes of the foo
library, because libraries' foo.h
header files tend to declare functions, not function pointers. This is a difficult approach, especially when replicated across the dozens of libraries that Open MPI can link to / utilize at run time.
Put differently: this is simply the nature of distributing binaries. You're building binaries in environment X and hoping that they are suitable for environment Y. In many (most?) cases, it's good enough. You've unfortunately run into a corner case where there's a pretty big performance impact because X != Y.
That being said, it is true that Open MPI has one glaring exception to what I said above: CUDA. This was a bit of a debate in Open MPI at the time when it was developed, for the reasons I cited above. However, we ultimately did implement a much-simplified "load CUDA at run time" mechanism for two reasons:
I will say that it took a number of iterations before the "load CUDA at run-time" code worked in all cases. It's difficult code to write, and is even more tricky to maintain over time (as APIs are added, removed, or even -- shrudder -- changed).
@jsquyres I totally understand your point. However, maybe the case of CMA is extremely simple, much simpler than CUDA. CMA can be invoked via syscalls, look at your own opal/include/opal/sys/cma.h
header file. So you do not really need any -lfoo
or dlsym()
, just a syscall to the kernel. At component load time, you execute something very similar to what you have in configure, that is, you try to memcopy (within the same process pid) from one small buffer to the other, and you check the syscall return code (and maybe whether the copy is successful) to declare CMA available or not. The other require change is in configure, instead of try_run, you should just use try_compile.
Anyway, I'm doing with @jsquyres what I hate others doing with me: asking for features without offering my time to work on them. If @YarShev is willing to offer some of his time to work on this, then I may consider working on a patch to submit upstream. Otherwise, I'm not going to push this thing any further, I don't have a strong personal interest in it.
@dalcinl, @jsquyres, thanks a lot for your thoughts. Unfornutately, at the moment I do not have time to help you push this further. If the issue is not that simple to be resolved, we can wait for the moment when the users get stuck in it and will request a fix with higher importance.
FWIW there are ucx
builds of openmpi
. ucx
does build with cma
as one of the transports and has the dlopen
logic that Lisandro alluded to above. So this may be one avenue to get things working.
Currently ucx
is only enabled in the CUDA builds, but we could generalize that to all builds ( https://github.com/conda-forge/openmpi-feedstock/issues/119 ) (thanks to recent ucx
package improvements!). Started working on this in PR ( https://github.com/conda-forge/openmpi-feedstock/pull/121 ), but running into some CI issues atm.
Still it should be possible to use the CUDA builds of openmpi
in the interim while the issue above gets sorted out.
As part of our work on unidist we did measurements for 1. Open MPI built from source, 2. ucx-enabled Open MPI from conda-forge, 3. ucx-disabled Open MPI from conda-forge. The order of Open MPI versions I mentioned corresponds to the timings the versions show. Open MPI built from source is the fastest among them, then ucx-enabled Open MPI from conda-forge goes, and ucx-disabled Open MPI from conda-forge is slower than both the formers.
I wonder what transport does ucx-enabled Open MPI from conda-forge have?
Solution to issue cannot be found in the documentation.
Issue
I installed openmpi from conda-forge and it doesn't have support for cross-memory attach.
Installed packages
Environment info