Open mrmundt opened 1 day ago
Also, FWIW, we did try conda install cudatoolkit cuda-version=11
to see if we could get past the error and got this:
>>> from mpi4py import MPI
[hostname:1499766] shmem: mmap: an error occurred while determining whether or not /tmp/ompi.hostname.53181/jf.0/2835742720/shared_mem_cuda_pool.hostname could be created.
[hostname:1499766] create_and_attach: unable to create shared memory BTL coordinating structure :: size 134217728
We aren't sure if this is worth reporting but wanted to let you know that it happens.
@minrk The last build is broken, a dependency on libcudart.so
slipped in libmpi.so
. The automated testing from conda-build did not catch the issue, I'm not sure why (maybe libcudart.soexists within
/usr/lib64` in the Docker image).
I'm not sure how to proceed, either we try with LDFLAGS=-Wl,--as-needed
(aren't these default?), or we manually patchelf the MPI library to remove the dependency.
$ python -s -c 'from mpi4py import MPI'
Traceback (most recent call last):
File "<string>", line 1, in <module>
from mpi4py import MPI
ImportError: libcudart.so.11.0: cannot open shared object file: No such file or directory
$ readelf -d $CONDA_PREFIX/lib/libmpi.so.40 | grep cuda
0x0000000000000001 (NEEDED) Shared library: [libcudart.so.11.0]
$ patchelf --remove-needed libcudart.so.11.0 $CONDA_PREFIX/lib/libmpi.so.40
$ python -s -c 'from mpi4py import MPI'
[kw61149:3113005] shmem: mmap: an error occurred while determining whether or not /tmp/ompi.kw61149.1000/jf.0/577503232/shared_mem_cuda_pool.kw61149 could be created.
[kw61149:3113005] create_and_attach: unable to create shared memory BTL coordinating structure :: size 134217728
Marking the latest build as broken: https://github.com/conda-forge/admin-requests/pull/1164
Maybe there's a module that needs to be added manually to the DSO list that's not on the DSO-by-default list. The simplest solution, I guess, is to go back to --enable-mca-dso
, since we know that worked, right?
Weirdly, when I started to test, neither the arm nor ppc builds have this. It's only linux-64.
I think we should be able to compare the output of the link check in this build with the latest build to perhaps identify which DSO that links cudart is being bundled.
We can also add a test liefldd/readelf/etc. or something to make sure it's not linked, I suppose.
Weirdly, when I started to test, neither the arm nor ppc builds have this. It's only linux-64.
Maybe some mishandled LDFLAGS?
We can also add a test liefldd/readelf/etc. or something to make sure it's not linked, I suppose.
Definitely.
Maybe there's a module that needs to be added manually to the DSO list that's not on the DSO-by-default list.
Unlikely, look at the test I posted above. After using patchelf
to remove the dependency, things actually work. The dependency on libcudart.so
seems to be redundant, indeed, although I did not try to run with CUDA to confirm things work afterwards. I still believe this is just overlinking. Maybe a -Wl,--as-needed
flag that is not being passed down properly.
The simplest solution, I guess, is to go back to --enable-mca-dso, since we know that worked, right?
Makes sense, although maybe there is an easier and proper fix. In any case, enhancements can be done later. I'll go offline for a couple days. If you have the time, got for it.
FWIW, the reason our tests are passing is that
/usr/local/cuda-11.8/targets/x86_64-linux/lib
is added in /etc/ld.so.conf
, so it gets loaded by default. If there's an easy way to ignore ld.so.conf
for a single process, I think our tests would fail as they should. I'm not sure how to do that, though. I can write a test that loads libmpi and checks for suspicious DLLs, though.
Solution to issue cannot be found in the documentation.
Issue
We have an automated test that started failing this afternoon suddenly due to a missing import. Upon further investigation, we see that there are now changes that cause a failure when attempting to import
mpi4py
in Python:It seems like there is another unconditional dependency required for this change (which we did not even realize had changed, BTW, because there was no indication via hash or version).
We believe this is a bug / unwanted behavior.
Installed packages
Environment info