Closed minrk closed 1 week ago
Hi! This is the friendly automated conda-forge-linting service.
I just wanted to let you know that I linted all conda-recipes in your PR (recipe/meta.yaml
) and found it was in an excellent condition.
I think this check also works to skip any customization of the ld.so.cache in the image:
> ld=/lib64/ld-linux*
> $ld --inhibit-cache $PREFIX/lib/libmpi.so
/opt/conda/lib/libmpi.so: error while loading shared libraries: libcudart.so.11.0: cannot open shared object file: No such file or directory
I think this check also works to skip any customization of the ld.so.cache in the image:
But what about other common system libraries in the cache?
Unfortunately, I don't think the simple ld test will work. While it does successfully raise the error for the missing file, when run on libraries that should load it segfaults after listing dependencies (even without the disabled cache). I'm not sure why that would be, so maybe i'll stick to the Python test.
Looks like we're hitting the same issue as https://github.com/open-mpi/ompi/pull/9145 which fixed it for libcuda
but not libcudart
where romio
would link cuda even if not using it. And we got unlucky in the ucx update because the build image was updated to make cudart available when it wasn't before, and this is unrelated to those changes.
I think the reason --as-needed
isn't doing what we expect is the argument order. The link command is:
libtool: link: x86_64-conda-linux-gnu-cc -shared -fPIC -DPIC $lots_of.o -Wl,--whole-archive $lots_of.a -Wl,--no-whole-archive -Wl,-rpath -Wl,$WORK/opal/.libs -Wl,-rpath -Wl,$WORK/3rd-party/openpmix/src/.libs -Wl,-rpath -Wl,$PREFIX/lib -L$WORK/3rd-party/openpmix/src/.libs -L$PREFIX/lib -lucc -lcudart $WORK/opal/.libs/libopen-pal.so -lucp -lucs -lucm -luct -lpthread -lrt $WORK/3rd-party/openpmix/src/.libs/libpmix.so -lm -lutil -ldl -levent_core -levent_pthreads -lhwloc -march=nocona -mtune=haswell -fstack-protector-strong -O2 -Wl,-O2 -Wl,--sort-common -Wl,--as-needed -Wl,-z -Wl,relro -Wl,-z -Wl,now -Wl,--disable-new-dtags -Wl,--gc-sections -Wl,--allow-shlib-undefined -Wl,-rpath -Wl,$PREFIX/lib -Wl,-rpath-link -Wl,$PREFIX/lib -pthread -Wl,-soname -Wl,libmpi.so.40 -o .libs/libmpi.so.40.40.5
where -Wl,--as-needed
comes after linking cudart, ucc, etc. I think --as-needed
only affects libraries that come after it, so it's not touching lcudart
, etc.. I believe this means openmpi puts $LDFLAGS in the wrong place in their link commands, but I can never say things with too much confidence when it comes to linking.
OK, you fixed the build, but your new Python tests are not good. Can't we just use something simpler, like running ldd on libmpi.so
, then grep for libcuda
, and fail if the thing is listed?
What about something like this:
if [[ $(ldd $CONDA_PREFIX/lib/libmpi.so | grep -cE '/libcuda.*\.so') -gt 0 ]]; then
echo "improper dependency on CUDA shared libraries"
ldd $CONDA_PREFIX/lib/libmpi.so
fi
I can do that. It's far narrower and much less robust than the Python test, but it would happen to catch this particular failure (the Python tests cover what is actually loaded, whereas the ldd test requires that you already know precisely which .so is going to be wrong).
BTW, the ldd approach will not work on cross-compilation. patchelf --print-needed libmpi.so
would do.
Switched to your ldd test, which I've confirmed by hand catch the broken build 104. It doesn't actually check if cuda will be loaded anymore, but it at least checks the one way in which cuda might be linked improperly that we are seeing today.
tested build 105 and it loads fine without cuda.
doesn't fix anything yet, so the new test should fail
cuda is available and findable from the image with ld.so.conf, so we have to do our own checks to make sure we don't load cuda from outside the env.
One thing that's weird is that
libcudart
is linked in the new no-mca-dso builds, while the old builds only ever linkedlibcuda
closes #185