ResearchComputing / core-software

Documentation and automation for provisioning the core software environment at University of Colorado Boulder Research Computing
1 stars 3 forks source link

openmpi/3 build does not integrate with srun #5

Open anderbubble opened 4 years ago

anderbubble commented 4 years ago

We also found that ompi/3.x will not integrate with srun (whereas ompi/2.x will). The error message when trying to run ompi/3.x executables with srun is:

An error occurred in MPI_Init" ("[Slurm] version 16.05 or later: you can use SLURM's PMIx support. This
requires that you configure and build SLURM --with-pmix.")

https://rmacc.slack.com/archives/CTCKASMGB/p1587676134244700

Originally

Compile openmpi/3.1.4 with the --with-pmi flag to ensure proper integration with Slurm srun

anderbubble commented 4 years ago

I have confirmed that openmpi/3 is already compiled with pmi support.

openmpi@3.1.5%gcc@8.4.0~cuda+cxx_exceptions fabrics=libfabric ~gpfs~java+legacylaunchers~memchecker+pmi schedulers=slurm ~sqlite3~thread_multiple+vt arch=linux-rhel7-haswell
    ^hwloc@1.11.11%gcc@8.4.0~cairo~cuda~gl+libxml2~nvml+pci+shared arch=linux-rhel7-haswell
        ^libpciaccess@0.13.5%gcc@8.4.0 arch=linux-rhel7-haswell
        ^libxml2@2.9.9%gcc@8.4.0~python arch=linux-rhel7-haswell
            ^libiconv@1.16%gcc@8.4.0 arch=linux-rhel7-haswell
            ^xz@5.2.5%gcc@8.4.0 arch=linux-rhel7-haswell
            ^zlib@1.2.11%gcc@8.4.0+optimize+pic+shared arch=linux-rhel7-haswell
        ^numactl@2.0.12%gcc@8.4.0 arch=linux-rhel7-haswell
    ^libfabric@1.9.1%gcc@8.4.0 fabrics=sockets,tcp,udp ~kdreg arch=linux-rhel7-haswell
    ^slurm@19-05-5-1%gcc@8.4.0~gtk~hdf5~hwloc~mariadb~pmix+readline arch=linux-rhel7-haswell
        ^curl@7.68.0%gcc@8.4.0~darwinssl~gssapi~libssh~libssh2~nghttp2 arch=linux-rhel7-haswell
            ^openssl@1.1.1e%gcc@8.4.0+systemcerts arch=linux-rhel7-haswell
        ^glib@2.56.3%gcc@8.4.0~libmount patches=c325997b72a205ad1638bb3e3ba0e5b73e3d32ce63b2d0d3282f3e3a2ff4663c tracing=none arch=linux-rhel7-haswell
            ^gettext@0.20.1%gcc@8.4.0+bzip2+curses+git~libunistring+libxml2+tar+xz arch=linux-rhel7-haswell
                ^bzip2@1.0.8%gcc@8.4.0+shared arch=linux-rhel7-haswell
                ^ncurses@6.2%gcc@8.4.0~symlinks+termlib arch=linux-rhel7-haswell
                ^tar@1.32%gcc@8.4.0 arch=linux-rhel7-haswell
            ^libffi@3.2.1%gcc@8.4.0 arch=linux-rhel7-haswell
            ^pcre@8.43%gcc@8.4.0~jit+multibyte+utf arch=linux-rhel7-haswell
            ^perl@5.30.1%gcc@8.4.0+cpanm+shared+threads arch=linux-rhel7-haswell
                ^gdbm@1.18.1%gcc@8.4.0 arch=linux-rhel7-haswell
                    ^readline@8.0%gcc@8.4.0 arch=linux-rhel7-haswell
            ^python@3.7.6%gcc@8.4.0+bz2+ctypes+dbm~debug+libxml2+lzma~nis~optimizations+pic+pyexpat+pythoncmd+readline+shared+sqlite3+ssl~tix~tkinter~ucs4~uuid+zlib arch=linux-rhel7-haswell
                ^expat@2.2.9%gcc@8.4.0+libbsd arch=linux-rhel7-haswell
                    ^libbsd@0.10.0%gcc@8.4.0 arch=linux-rhel7-haswell
                ^sqlite@3.30.1%gcc@8.4.0+column_metadata+fts~functions~rtree arch=linux-rhel7-haswell
        ^json-c@0.13.1%gcc@8.4.0 arch=linux-rhel7-haswell
        ^lz4@1.9.2%gcc@8.4.0 arch=linux-rhel7-haswell
        ^munge@0.5.14%gcc@8.4.0 localstatedir=PREFIX/var arch=linux-rhel7-haswell
            ^libgcrypt@1.8.5%gcc@8.4.0 arch=linux-rhel7-haswell
                ^libgpg-error@1.37%gcc@8.4.0 arch=linux-rhel7-haswell

So I'm pretty sure the issue is that the linked slurm doesn't have pmi support.

slurm@19-05-5-1%gcc@8.4.0~gtk~hdf5~hwloc~mariadb~pmix+readline arch=linux-rhel7-haswell
    ^curl@7.68.0%gcc@8.4.0~darwinssl~gssapi~libssh~libssh2~nghttp2 arch=linux-rhel7-haswell
        ^openssl@1.1.1e%gcc@8.4.0+systemcerts arch=linux-rhel7-haswell
            ^zlib@1.2.11%gcc@8.4.0+optimize+pic+shared arch=linux-rhel7-haswell
    ^glib@2.56.3%gcc@8.4.0~libmount patches=c325997b72a205ad1638bb3e3ba0e5b73e3d32ce63b2d0d3282f3e3a2ff4663c tracing=none arch=linux-rhel7-haswell
        ^gettext@0.20.1%gcc@8.4.0+bzip2+curses+git~libunistring+libxml2+tar+xz arch=linux-rhel7-haswell
            ^bzip2@1.0.8%gcc@8.4.0+shared arch=linux-rhel7-haswell
            ^libxml2@2.9.9%gcc@8.4.0~python arch=linux-rhel7-haswell
                ^libiconv@1.16%gcc@8.4.0 arch=linux-rhel7-haswell
                ^xz@5.2.5%gcc@8.4.0 arch=linux-rhel7-haswell
            ^ncurses@6.2%gcc@8.4.0~symlinks+termlib arch=linux-rhel7-haswell
            ^tar@1.32%gcc@8.4.0 arch=linux-rhel7-haswell
        ^libffi@3.2.1%gcc@8.4.0 arch=linux-rhel7-haswell
        ^pcre@8.43%gcc@8.4.0~jit+multibyte+utf arch=linux-rhel7-haswell
        ^perl@5.30.1%gcc@8.4.0+cpanm+shared+threads arch=linux-rhel7-haswell
            ^gdbm@1.18.1%gcc@8.4.0 arch=linux-rhel7-haswell
                ^readline@8.0%gcc@8.4.0 arch=linux-rhel7-haswell
        ^python@3.7.6%gcc@8.4.0+bz2+ctypes+dbm~debug+libxml2+lzma~nis~optimizations+pic+pyexpat+pythoncmd+readline+shared+sqlite3+ssl~tix~tkinter~ucs4~uuid+zlib arch=linux-rhel7-haswell
            ^expat@2.2.9%gcc@8.4.0+libbsd arch=linux-rhel7-haswell
                ^libbsd@0.10.0%gcc@8.4.0 arch=linux-rhel7-haswell
            ^sqlite@3.30.1%gcc@8.4.0+column_metadata+fts~functions~rtree arch=linux-rhel7-haswell
    ^json-c@0.13.1%gcc@8.4.0 arch=linux-rhel7-haswell
    ^lz4@1.9.2%gcc@8.4.0 arch=linux-rhel7-haswell
    ^munge@0.5.14%gcc@8.4.0 localstatedir=PREFIX/var arch=linux-rhel7-haswell
        ^libgcrypt@1.8.5%gcc@8.4.0 arch=linux-rhel7-haswell
            ^libgpg-error@1.37%gcc@8.4.0 arch=linux-rhel7-haswell
anderbubble commented 4 years ago

Believed fixed in 9fc0594ee8bf4e17a44de764549be6cb91f88f31.