conda-forge / openmpi-feedstock

A conda-smithy repository for openmpi.
BSD 3-Clause "New" or "Revised" License
10 stars 25 forks source link

fix and test for improper cuda linkage #186

Closed minrk closed 1 week ago

minrk commented 1 week ago

doesn't fix anything yet, so the new test should fail

cuda is available and findable from the image with ld.so.conf, so we have to do our own checks to make sure we don't load cuda from outside the env.

One thing that's weird is that libcudart is linked in the new no-mca-dso builds, while the old builds only ever linked libcuda

closes #185

conda-forge-admin commented 1 week ago

Hi! This is the friendly automated conda-forge-linting service.

I just wanted to let you know that I linted all conda-recipes in your PR (recipe/meta.yaml) and found it was in an excellent condition.

minrk commented 1 week ago

I think this check also works to skip any customization of the ld.so.cache in the image:

> ld=/lib64/ld-linux*
> $ld --inhibit-cache $PREFIX/lib/libmpi.so
/opt/conda/lib/libmpi.so: error while loading shared libraries: libcudart.so.11.0: cannot open shared object file: No such file or directory
dalcinl commented 1 week ago

I think this check also works to skip any customization of the ld.so.cache in the image:

But what about other common system libraries in the cache?

minrk commented 1 week ago

Unfortunately, I don't think the simple ld test will work. While it does successfully raise the error for the missing file, when run on libraries that should load it segfaults after listing dependencies (even without the disabled cache). I'm not sure why that would be, so maybe i'll stick to the Python test.

minrk commented 1 week ago

Looks like we're hitting the same issue as https://github.com/open-mpi/ompi/pull/9145 which fixed it for libcuda but not libcudart where romio would link cuda even if not using it. And we got unlucky in the ucx update because the build image was updated to make cudart available when it wasn't before, and this is unrelated to those changes.

I think the reason --as-needed isn't doing what we expect is the argument order. The link command is:

libtool: link: x86_64-conda-linux-gnu-cc -shared  -fPIC -DPIC  $lots_of.o  -Wl,--whole-archive $lots_of.a -Wl,--no-whole-archive  -Wl,-rpath -Wl,$WORK/opal/.libs -Wl,-rpath -Wl,$WORK/3rd-party/openpmix/src/.libs -Wl,-rpath -Wl,$PREFIX/lib -L$WORK/3rd-party/openpmix/src/.libs -L$PREFIX/lib -lucc -lcudart $WORK/opal/.libs/libopen-pal.so -lucp -lucs -lucm -luct -lpthread -lrt $WORK/3rd-party/openpmix/src/.libs/libpmix.so -lm -lutil -ldl -levent_core -levent_pthreads -lhwloc  -march=nocona -mtune=haswell -fstack-protector-strong -O2 -Wl,-O2 -Wl,--sort-common -Wl,--as-needed -Wl,-z -Wl,relro -Wl,-z -Wl,now -Wl,--disable-new-dtags -Wl,--gc-sections -Wl,--allow-shlib-undefined -Wl,-rpath -Wl,$PREFIX/lib -Wl,-rpath-link -Wl,$PREFIX/lib   -pthread -Wl,-soname -Wl,libmpi.so.40 -o .libs/libmpi.so.40.40.5

where -Wl,--as-needed comes after linking cudart, ucc, etc. I think --as-needed only affects libraries that come after it, so it's not touching lcudart, etc.. I believe this means openmpi puts $LDFLAGS in the wrong place in their link commands, but I can never say things with too much confidence when it comes to linking.

dalcinl commented 1 week ago

OK, you fixed the build, but your new Python tests are not good. Can't we just use something simpler, like running ldd on libmpi.so, then grep for libcuda, and fail if the thing is listed?

dalcinl commented 1 week ago

What about something like this:

if [[ $(ldd $CONDA_PREFIX/lib/libmpi.so | grep -cE '/libcuda.*\.so') -gt 0 ]]; then
    echo "improper dependency on CUDA shared libraries"
    ldd $CONDA_PREFIX/lib/libmpi.so
fi
minrk commented 1 week ago

I can do that. It's far narrower and much less robust than the Python test, but it would happen to catch this particular failure (the Python tests cover what is actually loaded, whereas the ldd test requires that you already know precisely which .so is going to be wrong).

dalcinl commented 1 week ago

BTW, the ldd approach will not work on cross-compilation. patchelf --print-needed libmpi.so would do.

minrk commented 1 week ago

Switched to your ldd test, which I've confirmed by hand catch the broken build 104. It doesn't actually check if cuda will be loaded anymore, but it at least checks the one way in which cuda might be linked improperly that we are seeing today.

minrk commented 1 week ago

tested build 105 and it loads fine without cuda.