fedora-cloud / docker-brew-fedora

MIT License
176 stars 46 forks source link

mpi4py.MPI crashes on import in fedora:38 #112

Open spossann opened 11 months ago

spossann commented 11 months ago

I have been using mpi4py with the image fedora:37 for a while without issues. However, with fedora:38 (and later) I see the following error:

I build the following image via docker build -t fedora_test -f docker/fedora.dockerfile .:

# docker/fedora.dockerfile
FROM fedora:38

RUN dnf install -y python3-pip \
    && dnf install -y gcc \
    && dnf install -y gfortran \ 
    && dnf install -y blas-devel lapack-devel \ 
    && dnf install -y openmpi openmpi-devel \
    && dnf install -y libgomp \
    && dnf install -y git \
    && dnf install -y environment-modules \
    && dnf install -y python3-mpi4py-openmpi \
    && dnf install -y python3-devel 

then run a container with docker run -it fedora_test and load the openmpi module

module load mpi/openmpi-x86_64

I then launch a Python virtual environment

python3 -m venv env
source env/bin/activate

and install mpi4py:

pip install mpi4py

Upon launching python3 and importing MPI from mpi4py, I get the following error:

Python 3.11.5 (main, Aug 28 2023, 00:00:00) [GCC 13.2.1 20230728 (Red Hat 13.2.1-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from mpi4py import MPI
64ad81b28a4c:pid259.python3: Failed to get eth0 (unit 0) cpu set
64ad81b28a4c:pid259: PSM3 can't open nic unit: 0 (err=23)
PMIx Log Report:[259]: (nic/PSM)[259]: PSM3 can't open nic unit: 0 (err=23)
64ad81b28a4c:pid259.python3: Failed to get eth0 (unit 0) cpu set
64ad81b28a4c:pid259: PSM3 can't open nic unit: 0 (err=23)
PMIx Log Report:[259]: (nic/PSM)[259]: PSM3 can't open nic unit: 0 (err=23)

The program is stuck from there on. All of the above steps work fine with the image fedora:37.

Here are some infos on my OS:

$ uname -a
Linux ######### 6.2.0-34-generic #34~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Sep  7 13:12:03 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
mattdm commented 11 months ago

I just tried this on Fedora Linux 38 (and with podman and instead of docker) and the basic import works fine with no errors.

I notice you are installing both the Fedora-provided python3-mpi4py-openmpi and mpi4py via pip. I tried both, and both seem fine.

The errors you see seem to be something to do with accessing the NIC — or is that a red herring?

spossann commented 11 months ago

Thanks for the quick reply. To be clear: import mpi4py works for me, what does not work is

from mpi4py import MPI

I clarified this in the issue title. The namespace of mpi4py looks like this:

Python 3.11.5 (main, Aug 28 2023, 00:00:00) [GCC 13.2.1 20230728 (Red Hat 13.2.1-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import mpi4py
>>> dir(mpi4py)
['Rc', '__all__', '__author__', '__builtins__', '__cached__', '__credits__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '__version__', 'get_config', 'get_include', 'profile', 'rc']


>>> dir(mpi4py.__all__)
['__add__', '__class__', '__class_getitem__', '__contains__', '__delattr__', '__delitem__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__iadd__', '__imul__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__reversed__', '__rmul__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', 'append', 'clear', 'copy', 'count', 'extend', 'index', 'insert', 'pop', 'remove', 'reverse', 'sort']

I also tried removing python3-mpi4py-openmpi but the error persists.

mattdm commented 11 months ago

from mpi4py import MPI as in your example above is exactly what I did in attempting to duplicate your issue. Sorry that wasn't clear.

richardvanderburgh commented 8 months ago

Adding these environment variables seemed to resolve this issue for me. OMPI_MCA_pml=ob1 OMPI_MCA_btl=tcp,self