EESSI / software-layer

Software layer of the EESSI project
https://eessi.github.io/docs/software_layer
GNU General Public License v2.0
24 stars 47 forks source link

Regression in supported interconnects #63

Open ocaisa opened 3 years ago

ocaisa commented 3 years ago

I was looking at the UCX configuration in 2020.12 and I noticed that it looks like we have a regression. From https://github.com/EESSI/compatibility-layer/issues/49#issuecomment-706192572 it looks like we should have a UCX configuration like

configure: =========================================================
configure: UCX build configuration:
configure:       Build prefix:   /home/bob/ucx/inst
configure: Preprocessor flags:   -DCPU_FLAGS="|avx" -I${abs_top_srcdir}/src -I${abs_top_builddir} -I${abs_top_builddir}/src
configure:         C compiler:   x86_64-pc-linux-gnu-gcc -O3 -g -Wall -Werror -mavx
configure:       C++ compiler:   x86_64-pc-linux-gnu-g++ -O3 -g -Wall -Werror -mavx
configure:       Multi-thread:   enabled
configure:          MPI tests:   disabled
configure:      Devel headers:   no
configure:           Bindings:   < >
configure:        UCT modules:   < ib rdmacm cma >
configure:       CUDA modules:   < >
configure:       ROCM modules:   < >
configure:         IB modules:   < >
configure:        UCM modules:   < >
configure:       Perf modules:   < >
configure: =========================================================

but in the build log for UCX (on Zen2) I see

configure: =========================================================
configure: UCX build configuration:
configure:       Build prefix:   /cvmfs/pilot.eessi-hpc.org/2020.12/software/x86_64/amd/zen2/software/UCX/1.8.0-GCCcore-9.3.0
configure: Preprocessor flags:   -DCPU_FLAGS="|avx" -I${abs_top_srcdir}/src -I${abs_top_builddir} -I${abs_top_builddir}/src
configure:         C compiler:   gcc -O3 -g -Wall -Werror -mavx
configure:       C++ compiler:   g++ -O3 -g -Wall -Werror -mavx
configure:       Multi-thread:   enabled
configure:          MPI tests:   disabled
configure:      Devel headers:   no
configure:           Bindings:   < >
configure:        UCT modules:   < ib cma >
configure:       CUDA modules:   < >
configure:       ROCM modules:   < >
configure:         IB modules:   < >
configure:        UCM modules:   < >
configure:       Perf modules:   < >
configure: =========================================================

(note the missing rdmacm)

We should probably explicitly insert what we expect from the final build (--with-rdmacm) so that configure will fail rather than build regardless. UCX in particular is critical to the stack so could do with additonal checks.

ocaisa commented 3 years ago

To get it get it to use rdma inside the prefix layer I needed to explicitly provide the path:

configopts = '--enable-optimizations --enable-cma --enable-mt --with-verbs --with-rdmacm=/cvmfs/pilot.eessi-hpc.org/2020.12/compat/linux/x86_64/usr --with-sysroot=/cvmfs/pilot.eessi-hpc.org/2020.12/compat/linux/x86_64'

Just setting --with-sysroot is not enough. This is probably why on some archs you might get the support and on others not, it depends on what is on the host.

ocaisa commented 3 years ago

This is not really an issue with the prefix layer, I'm going to move it

ocaisa commented 3 years ago

I also checked libfabric and I see in the configure there

checking for sysroot... no

which I would also have suspicions about.

bedroge commented 3 years ago

I checked our 2020.10 installation, and it has the same issue / configuration output. The output in the comment at https://github.com/EESSI/compatibility-layer/issues/49#issuecomment-706192572 is from a manual UCX installation where I indeed explicitly passed --with-rdmacm to the configure, so we should somehow pass this to our UCX installation as well (using a hook?).

bedroge commented 3 years ago

I see that the configure of both libfabric and UCX allow a --with-sysroot flag:

  --with-sysroot=DIR Search for dependent libraries within DIR
                        (or the compiler's sysroot if not specified).

As the compiler has been configured with --with-sysroot set to the prefix, I assume we don't necessarily have to use this flag for these packages.

bedroge commented 3 years ago

This has been fixed in 2021.03:

configure:        UCT modules:   < ib rdmacm cma >

We should still have some (ReFrame?) test for this to make sure that UCX is always correctly configured in future versions, though, so let's leave this issue open to not forget about this.