openmpi doesn't have support for cross-memory attach

YarShev commented 1 year ago

Solution to issue cannot be found in the documentation.

[X] I checked the documentation.

Issue

I installed openmpi from conda-forge and it doesn't have support for cross-memory attach.

conda create -n test-openmpi python=3.8
conda activate test-openmpi
conda install -c conda-forge mpi4py openmpi
ompi_info --param btl vader -l 3
MCA btl: vader (MCA v2.1.0, API v3.1.0, Component v4.0.4)
           MCA btl vader: ---------------------------------------------------
           MCA btl vader: parameter "btl_vader_single_copy_mechanism"
                          (current value: "emulated", data source: default,
                          level: 3 user/all, type: int)
                          Single copy mechanism to use (defaults to best
                          available)
                          Valid values: 4:"emulated", 3:"none"
           MCA btl vader: parameter "btl_vader_backing_directory" (current
                          value: "/dev/shm", data source: default, level: 3
                          user/all, type: string)
                          Directory to place backing files for shared memory
                          communication. This directory should be on a local
                          filesystem such as /tmp or /dev/shm (default:
                          (linux) /dev/shm, (others) session directory)

Installed packages

_libgcc_mutex             0.1                        main
_openmp_mutex             5.1                       1_gnu
ca-certificates           2022.12.7            ha878542_0    conda-forge
certifi                   2022.12.7          pyhd8ed1ab_0    conda-forge
ld_impl_linux-64          2.38                 h1181459_1
libffi                    3.4.2                h6a678d5_6
libgcc-ng                 11.2.0               h1234567_1
libgfortran-ng            7.5.0               h14aa051_20    conda-forge
libgfortran4              7.5.0               h14aa051_20    conda-forge
libgomp                   11.2.0               h1234567_1
libstdcxx-ng              11.2.0               h1234567_1
mpi                       1.0                     openmpi    conda-forge
mpi4py                    3.1.4            py38h3e5f7c9_0
ncurses                   6.4                  h6a678d5_0
openmpi                   4.0.4                hdf1f1ad_0    conda-forge
openssl                   1.1.1t               h7f8727e_0
pip                       23.0.1           py38h06a4308_0
python                    3.8.16               h7a1cb2a_3
readline                  8.2                  h5eee18b_0
setuptools                65.6.3           py38h06a4308_0
sqlite                    3.41.1               h5eee18b_0
tk                        8.6.12               h1ccaba5_0
wheel                     0.38.4           py38h06a4308_0
xz                        5.2.10               h5eee18b_1
zlib                      1.2.13               h5eee18b_0

Environment info

active environment : test-openmpi
    active env location : $CONDA_PATH/miniconda3/envs/test-openmpi
            shell level : 1
       user config file : $HOME_PATH/.condarc
 populated config files :
          conda version : 4.12.0
    conda-build version : not installed
         python version : 3.8.13.final.0
       virtual packages : __cuda=12.0=0
                          __linux=5.4.0=0
                          __glibc=2.31=0
                          __unix=0=0
                          __archspec=1=x86_64
       base environment : $CONDA_PATH/miniconda3  (writable)
      conda av data dir : $CONDA_PATH/miniconda3/etc/conda
  conda av metadata url : None
           channel URLs : https://repo.anaconda.com/pkgs/main/linux-64
                          https://repo.anaconda.com/pkgs/main/noarch
                          https://repo.anaconda.com/pkgs/r/linux-64
                          https://repo.anaconda.com/pkgs/r/noarch
          package cache : $CONDA_PATH/miniconda3/pkgs
                          $HOME_PATH/.conda/pkgs
       envs directories : $CONDA_PATH/miniconda3/envs
                          $HOME_PATH/.conda/envs
               platform : linux-64
             user-agent : conda/4.12.0 requests/2.27.1 CPython/3.8.13 Linux/5.4.0-136-generic ubuntu/20.04.4 glibc/2.31
                UID:GID : 11815093:22002
             netrc file : None
           offline mode : False

YarShev commented 1 year ago

More details here https://github.com/mpi4py/mpi4py/discussions/332#discussioncomment-5417296

dalcinl commented 1 year ago

@jsquyres Sorry, I need your input again.

I had a quick look to ompi's configure.ac, and now I realize what's going on. configure is using a try_run check for CMA availability. This is is suboptimal, as the build machine is not necessarily the runtime machine.

That's the problem with the conda-forge binaries, they are built in Azure Pipelines within a container without the required support. I'm not an autotools expert, but looking at ompi's config/opal_check_cma.m4, it looks like to force CMA using --with-cma=yes would be ineffective, configure will stop with error.

The other situation can also happen. You use the openmpi package from a Linux distro that was built in bare metal with CMA enabled, but then you run apps within a container, things do not work out of the box, and CMA has to be explicitly disabled (example).

Would it be possible for Open MPI to move the try_run check to runtime, and if the syscalls cannot be used, then disable CMA use at runtime (maybe with a warning)? I understand other things work that way (eg. CUDA support), why not CMA support?.

jsquyres commented 1 year ago

The vast majority of Open MPI's infrastructure assumes that the environment where configure is run is the same as where MPI jobs will be run. Writing it that way -- which is the same as most other Linux libraries and applications -- allows Open MPI to just -lfoo to link in dependent libraries and let the run-time linker handle all of the chasing down of libraries at run time (which is both exceedingly complicated and exactly what the run-time linker is for).

The alternative is for Open MPI to essentially duplicate the behavior of the run-time linker (via dlopen() calls and the searching of paths, replicating RPATH and RUNPATH behavior ... etc.). It would be a massive undertaking to switch from a -lfoo-based architecture to an architecture based on dlopen() + dlsym() + invoke all functions via function pointers + look up all globals by name/pointer at run time. We'd also have to replicate all function prototypes of the foo library, because libraries' foo.h header files tend to declare functions, not function pointers. This is a difficult approach, especially when replicated across the dozens of libraries that Open MPI can link to / utilize at run time.

Put differently: this is simply the nature of distributing binaries. You're building binaries in environment X and hoping that they are suitable for environment Y. In many (most?) cases, it's good enough. You've unfortunately run into a corner case where there's a pretty big performance impact because X != Y.

That being said, it is true that Open MPI has one glaring exception to what I said above: CUDA. This was a bit of a debate in Open MPI at the time when it was developed, for the reasons I cited above. However, we ultimately did implement a much-simplified "load CUDA at run time" mechanism for two reasons:

The vendor -- NVIDIA -- was willing to write and maintain the code to do it.
GPUs are not typically deployed on login / compile nodes in HPC environments, and -- at least at the time -- the CUDA development stack could not be installed on machines that did not have GPU hardware. While homogeneity is quite common in HPC environments, this disparity of the presence (or not) of the CUDA libraries was a pain point for NVIDIA's customers: they would build Open MPI on the login / compile nodes (which would build without CUDA support), and then customers would be surprised when they ran on back-end GPU-enabled nodes and not have native GPU support. This happened enough that NVIDIA implemented the strategy to alleviate their customers' pain point.

I will say that it took a number of iterations before the "load CUDA at run-time" code worked in all cases. It's difficult code to write, and is even more tricky to maintain over time (as APIs are added, removed, or even -- shrudder -- changed).

dalcinl commented 1 year ago

@jsquyres I totally understand your point. However, maybe the case of CMA is extremely simple, much simpler than CUDA. CMA can be invoked via syscalls, look at your own opal/include/opal/sys/cma.h header file. So you do not really need any -lfoo or dlsym(), just a syscall to the kernel. At component load time, you execute something very similar to what you have in configure, that is, you try to memcopy (within the same process pid) from one small buffer to the other, and you check the syscall return code (and maybe whether the copy is successful) to declare CMA available or not. The other require change is in configure, instead of try_run, you should just use try_compile.

Anyway, I'm doing with @jsquyres what I hate others doing with me: asking for features without offering my time to work on them. If @YarShev is willing to offer some of his time to work on this, then I may consider working on a patch to submit upstream. Otherwise, I'm not going to push this thing any further, I don't have a strong personal interest in it.

YarShev commented 1 year ago

@dalcinl, @jsquyres, thanks a lot for your thoughts. Unfornutately, at the moment I do not have time to help you push this further. If the issue is not that simple to be resolved, we can wait for the moment when the users get stuck in it and will request a fix with higher importance.

jakirkham commented 1 year ago

FWIW there are ucx builds of openmpi. ucx does build with cma as one of the transports and has the dlopen logic that Lisandro alluded to above. So this may be one avenue to get things working.

Currently ucx is only enabled in the CUDA builds, but we could generalize that to all builds ( https://github.com/conda-forge/openmpi-feedstock/issues/119 ) (thanks to recent ucx package improvements!). Started working on this in PR ( https://github.com/conda-forge/openmpi-feedstock/pull/121 ), but running into some CI issues atm.

Still it should be possible to use the CUDA builds of openmpi in the interim while the issue above gets sorted out.

YarShev commented 1 year ago

As part of our work on unidist we did measurements for 1. Open MPI built from source, 2. ucx-enabled Open MPI from conda-forge, 3. ucx-disabled Open MPI from conda-forge. The order of Open MPI versions I mentioned corresponds to the timings the versions show. Open MPI built from source is the fastest among them, then ucx-enabled Open MPI from conda-forge goes, and ucx-disabled Open MPI from conda-forge is slower than both the formers.

I wonder what transport does ucx-enabled Open MPI from conda-forge have?

conda-forge / openmpi-feedstock