conda-forge / openmpi-feedstock

A conda-smithy repository for openmpi.
BSD 3-Clause "New" or "Revised" License
10 stars 25 forks source link

Segmentation fault issue #171

Closed yonghoonlee closed 4 months ago

yonghoonlee commented 4 months ago

Solution to issue cannot be found in the documentation.

Issue

I experience segmentation fault (signal 11) on the use of MPI on Linux. Originally, I suspected the bug of mpi4py, but it seems that openmpi might be more relevant to this error, as @dalcinl mentioned in https://github.com/mpi4py/mpi4py/issues/523. Here's how the segmentation fault is replicated.

module purge # I purge all modules and use conda to automatically determine dependencies
conda create -y --name test-env python mpi4py
conda activate test-env
mpirun -np 4 python test.py

The python code is

from mpi4py import MPI
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
print(rank)

The expected result is

0
1
2
3

The actual result is

[log02:599094:0:599094] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x3ff00000001)
[log02:599097:0:599097] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x3ff00000001)
[log02:599095:0:599095] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x3ff00000001)
[log02:599096:0:599096] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x3ff00000001)

Installed packages

# packages in environment at /home/yhlee/miniforge3/envs/test-env:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       2_gnu    conda-forge
bzip2                     1.0.8                h4bc722e_7    conda-forge
ca-certificates           2024.7.4             hbcca054_0    conda-forge
icu                       75.1                 he02047a_0    conda-forge
ld_impl_linux-64          2.40                 hf3520f5_7    conda-forge
libevent                  2.1.12               hf998b51_1    conda-forge
libexpat                  2.6.2                h59595ed_0    conda-forge
libffi                    3.4.2                h7f98852_5    conda-forge
libgcc-ng                 14.1.0               h77fa898_0    conda-forge
libgfortran-ng            14.1.0               h69a702a_0    conda-forge
libgfortran5              14.1.0               hc5f4f2c_0    conda-forge
libgomp                   14.1.0               h77fa898_0    conda-forge
libhwloc                  2.11.1          default_hecaa2ac_1000    conda-forge
libiconv                  1.17                 hd590300_2    conda-forge
libnl                     3.10.0               h4bc722e_0    conda-forge
libnsl                    2.0.1                hd590300_0    conda-forge
libsqlite                 3.46.0               hde9e2c9_0    conda-forge
libstdcxx-ng              14.1.0               hc0a3c3a_0    conda-forge
libuuid                   2.38.1               h0b41bf4_0    conda-forge
libxcrypt                 4.4.36               hd590300_1    conda-forge
libxml2                   2.12.7               he7c6b58_4    conda-forge
libzlib                   1.3.1                h4ab18f5_1    conda-forge
mpi                       1.0                     openmpi    conda-forge
mpi4py                    3.1.6           py312hae4ded5_1    conda-forge
ncurses                   6.5                  h59595ed_0    conda-forge
openmpi                   5.0.3              h9a79eee_110    conda-forge
openssl                   3.3.1                h4bc722e_2    conda-forge
pip                       24.0               pyhd8ed1ab_0    conda-forge
python                    3.12.4          h194c7f8_0_cpython    conda-forge
python_abi                3.12                    4_cp312    conda-forge
readline                  8.2                  h8228510_1    conda-forge
setuptools                71.0.4             pyhd8ed1ab_0    conda-forge
tk                        8.6.13          noxft_h4845f30_101    conda-forge
tzdata                    2024a                h0c530f3_0    conda-forge
wheel                     0.43.0             pyhd8ed1ab_1    conda-forge
xz                        5.2.6                h166bdaf_0    conda-forge

Environment info

active environment : test-env
    active env location : /home/yhlee/miniforge3/envs/test-env
            shell level : 2
       user config file : /home/yhlee/.condarc
 populated config files : /home/yhlee/miniforge3/.condarc
                          /home/yhlee/.condarc
          conda version : 24.5.0
    conda-build version : not installed
         python version : 3.10.14.final.0
                 solver : libmamba (default)
       virtual packages : __archspec=1=zen4
                          __conda=24.5.0=0
                          __glibc=2.28=0
                          __linux=4.18.0=0
                          __unix=0=0
       base environment : /home/yhlee/miniforge3  (writable)
      conda av data dir : /home/yhlee/miniforge3/etc/conda
  conda av metadata url : None
           channel URLs : https://conda.anaconda.org/conda-forge/linux-64
                          https://conda.anaconda.org/conda-forge/noarch
          package cache : /home/yhlee/miniforge3/pkgs
                          /home/yhlee/.conda/pkgs
       envs directories : /home/yhlee/miniforge3/envs
                          /home/yhlee/.conda/envs
               platform : linux-64
             user-agent : conda/24.5.0 requests/2.31.0 CPython/3.10.14 Linux/4.18.0-477.10.1.el8_8.x86_64 rocky/8.8 glibc/2.28 solver/libmamba conda-libmamba-solver/24.1.0 libmambapy/1.5.8
                UID:GID : 63599:100
             netrc file : None
           offline mode : False
yonghoonlee commented 4 months ago

There is another clue to provide. If I run mpirun using slurm script, it shows slightly more detailed error message:

cpu-bind=MASK - ac07, task  0  0 [2583294]: mask 0x4000000000000040040040 set
[ac07:2583382:0:2583382] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x3ff00000001)
[ac07:2583380:0:2583380] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x3ff00000001)
[ac07:2583381:0:2583381] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x3ff00000001)
[ac07:2583383:0:2583383] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x3ff00000001)
--------------------------------------------------------------------------
prterun noticed that process rank 2 with PID 2583382 on node ac07 exited on
signal 11 (Segmentation fault).
--------------------------------------------------------------------------

The slurm script I use is

#!/bin/bash
#SBATCH --ntasks        4
#SBATCH --cpus-per-task 1
#SBATCH --time          00:01:00
#SBATCH --mem-per-cpu   1G
#SBATCH --partition     acomputeq
#SBATCH --job-name      mpi
#SBATCH --output        mpi-%J.out
#SBATCH --error         mpi-%J.err
module purge
source ~/.bashrc
source activate test-env
mpirun python test.py
yonghoonlee commented 4 months ago

If I downgrade to a different version of dependencies, it works. For example, if I create the conda environment with forcing mpi4py=3.1.5 as

module purge # I purge all modules and use conda to automatically determine dependencies
conda create -y --name test-env python mpi4py=3.1.5
conda activate test-env
mpirun -np 4 python test.py

Then the result comes out without the segmentation fault, as:

0
1
2
3

Here, the conda list is:

# packages in environment at /home/yhlee/miniforge3/envs/test-env:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       2_gnu    conda-forge
bzip2                     1.0.8                h4bc722e_7    conda-forge
ca-certificates           2024.7.4             hbcca054_0    conda-forge
ld_impl_linux-64          2.40                 hf3520f5_7    conda-forge
libexpat                  2.6.2                h59595ed_0    conda-forge
libffi                    3.4.2                h7f98852_5    conda-forge
libgcc-ng                 14.1.0               h77fa898_0    conda-forge
libgfortran-ng            14.1.0               h69a702a_0    conda-forge
libgfortran5              14.1.0               hc5f4f2c_0    conda-forge
libgomp                   14.1.0               h77fa898_0    conda-forge
libnsl                    2.0.1                hd590300_0    conda-forge
libsqlite                 3.46.0               hde9e2c9_0    conda-forge
libstdcxx-ng              14.1.0               hc0a3c3a_0    conda-forge
libuuid                   2.38.1               h0b41bf4_0    conda-forge
libxcrypt                 4.4.36               hd590300_1    conda-forge
libzlib                   1.3.1                h4ab18f5_1    conda-forge
mpi                       1.0                       mpich    conda-forge
mpi4py                    3.1.5           py312h5256a87_1    conda-forge
mpich                     4.2.2              h4a7f18d_100    conda-forge
ncurses                   6.5                  h59595ed_0    conda-forge
openssl                   3.3.1                h4bc722e_2    conda-forge
pip                       24.0               pyhd8ed1ab_0    conda-forge
python                    3.12.4          h194c7f8_0_cpython    conda-forge
python_abi                3.12                    4_cp312    conda-forge
readline                  8.2                  h8228510_1    conda-forge
setuptools                71.0.4             pyhd8ed1ab_0    conda-forge
tk                        8.6.13          noxft_h4845f30_101    conda-forge
tzdata                    2024a                h0c530f3_0    conda-forge
wheel                     0.43.0             pyhd8ed1ab_1    conda-forge
xz                        5.2.6                h166bdaf_0    conda-forge
dalcinl commented 4 months ago

@yonghoonlee Note that when you install mpi4py 3.1.5, this time you are not getting Open MPI but MPICH. That is because you are not explicitly telling conda what's the MPI you want. Please add openmpi explicitily to the list of packages in the conda create ... invocation.

dalcinl commented 4 months ago

Also, once you create your environments with mpi4py=3.1.6 and openmpi explicitly, can you try running the following and show us the output?

python -m mpi4py --prefix
python -m mpi4py --mpi-lib-version
mpiexec -n 4 python -m mpi4py.bench ringtest
mpiexec -n 4 python -m mpi4py.bench helloworld
yonghoonlee commented 4 months ago

Conda environment creation

conda create --name test-env python mpi4py=3.1.6 openmpi
conda activate test-env

Command entered python -m mpi4py --prefix

Response

/home/yhlee/miniforge3/envs/test-env/lib/python3.12/site-packages/mpi4py

Command entered python -m mpi4py --mpi-lib-version

Response

Open MPI v5.0.3, package: Open MPI conda@2a526d73f59c Distribution, ident: 5.0.3, repo rev: v5.0.3, Apr 08, 2024

Command entered mpiexec -n 4 python -m mpi4py.bench ringtest

Response

[log02:837208:0:837208] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x3ff00000001)
[log02:837211:0:837211] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x3ff00000001)
[log02:837209:0:837209] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x3ff00000001)
[log02:837210:0:837210] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x3ff00000001)
--------------------------------------------------------------------------
prterun noticed that process rank 0 with PID 837208 on node log02 exited on
signal 11 (Segmentation fault).
--------------------------------------------------------------------------

mpiexec -n 4 python -m mpi4py.bench helloworld

[log02:838070:0:838070] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x3ff00000001)
[log02:838073:0:838073] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x3ff00000001)
[log02:838071:0:838071] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x3ff00000001)
[log02:838072:0:838072] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x3ff00000001)
--------------------------------------------------------------------------
prterun noticed that process rank 0 with PID 838070 on node log02 exited on
signal 11 (Segmentation fault).
--------------------------------------------------------------------------

Conda environment package list conda list

Response

# packages in environment at /home/yhlee/miniforge3/envs/test-env:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       2_gnu    conda-forge
bzip2                     1.0.8                h4bc722e_7    conda-forge
ca-certificates           2024.7.4             hbcca054_0    conda-forge
icu                       75.1                 he02047a_0    conda-forge
ld_impl_linux-64          2.40                 hf3520f5_7    conda-forge
libevent                  2.1.12               hf998b51_1    conda-forge
libexpat                  2.6.2                h59595ed_0    conda-forge
libffi                    3.4.2                h7f98852_5    conda-forge
libgcc-ng                 14.1.0               h77fa898_0    conda-forge
libgfortran-ng            14.1.0               h69a702a_0    conda-forge
libgfortran5              14.1.0               hc5f4f2c_0    conda-forge
libgomp                   14.1.0               h77fa898_0    conda-forge
libhwloc                  2.11.1          default_hecaa2ac_1000    conda-forge
libiconv                  1.17                 hd590300_2    conda-forge
libnl                     3.10.0               h4bc722e_0    conda-forge
libnsl                    2.0.1                hd590300_0    conda-forge
libsqlite                 3.46.0               hde9e2c9_0    conda-forge
libstdcxx-ng              14.1.0               hc0a3c3a_0    conda-forge
libuuid                   2.38.1               h0b41bf4_0    conda-forge
libxcrypt                 4.4.36               hd590300_1    conda-forge
libxml2                   2.12.7               he7c6b58_4    conda-forge
libzlib                   1.3.1                h4ab18f5_1    conda-forge
mpi                       1.0                     openmpi    conda-forge
mpi4py                    3.1.6           py312hae4ded5_1    conda-forge
ncurses                   6.5                  h59595ed_0    conda-forge
openmpi                   5.0.3              h9a79eee_110    conda-forge
openssl                   3.3.1                h4bc722e_2    conda-forge
pip                       24.0               pyhd8ed1ab_0    conda-forge
python                    3.12.4          h194c7f8_0_cpython    conda-forge
python_abi                3.12                    4_cp312    conda-forge
readline                  8.2                  h8228510_1    conda-forge
setuptools                71.0.4             pyhd8ed1ab_0    conda-forge
tk                        8.6.13          noxft_h4845f30_101    conda-forge
tzdata                    2024a                h0c530f3_0    conda-forge
wheel                     0.43.0             pyhd8ed1ab_1    conda-forge
xz                        5.2.6                h166bdaf_0    conda-forge

Conda environment info conda info

Response

     active environment : test-env
    active env location : /home/yhlee/miniforge3/envs/test-env
            shell level : 2
       user config file : /home/yhlee/.condarc
 populated config files : /home/yhlee/miniforge3/.condarc
                          /home/yhlee/.condarc
          conda version : 24.5.0
    conda-build version : not installed
         python version : 3.10.14.final.0
                 solver : libmamba (default)
       virtual packages : __archspec=1=zen4
                          __conda=24.5.0=0
                          __glibc=2.28=0
                          __linux=4.18.0=0
                          __unix=0=0
       base environment : /home/yhlee/miniforge3  (writable)
      conda av data dir : /home/yhlee/miniforge3/etc/conda
  conda av metadata url : None
           channel URLs : https://conda.anaconda.org/conda-forge/linux-64
                          https://conda.anaconda.org/conda-forge/noarch
          package cache : /home/yhlee/miniforge3/pkgs
                          /home/yhlee/.conda/pkgs
       envs directories : /home/yhlee/miniforge3/envs
                          /home/yhlee/.conda/envs
               platform : linux-64
             user-agent : conda/24.5.0 requests/2.31.0 CPython/3.10.14 Linux/4.18.0-477.10.1.el8_8.x86_64 rocky/8.8 glibc/2.28 solver/libmamba conda-libmamba-solver/24.1.0 libmambapy/1.5.8
                UID:GID : 63599:100
             netrc file : None
           offline mode : False

Thank you.

yonghoonlee commented 4 months ago

@yonghoonlee Note that when you install mpi4py 3.1.5, this time you are not getting Open MPI but MPICH. That is because you are not explicitly telling conda what's the MPI you want. Please add openmpi explicitily to the list of packages in the conda create ... invocation.

Yes, that looks true. I will test both mpi4py=3.1.5 and 3.1.6 with openmpi.

yonghoonlee commented 4 months ago

@dalcinl I performed a few more tests.

conda create --name test-env python mpi4py=3.1.5 openmpi contains openmpi=4.1.6 conda create --name test-env python mpi4py=3.1.6 openmpi contains openmpi=5.0.3

They behave differently.

For the environment with mpi4py=3.1.5 (which installs openmpi=4.1.6)

mpiexec -n 4 python -m mpi4py.bench ringtest

gives

--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   log02
  Local device: mlx5_0
--------------------------------------------------------------------------
time for 1 loops = 6.2891e-05 seconds (4 processes, 1 bytes)
[log02:869941] 3 more processes have sent help message help-mpi-btl-openib.txt / error in device init
[log02:869941] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

For the environment with mpi4py=3.1.6 (which installs openmpi=5.0.3)

mpiexec -n 4 python -m mpi4py.bench ringtest

gives

[log02:837208:0:837208] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x3ff00000001)
[log02:837211:0:837211] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x3ff00000001)
[log02:837209:0:837209] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x3ff00000001)
[log02:837210:0:837210] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x3ff00000001)
--------------------------------------------------------------------------
prterun noticed that process rank 0 with PID 837208 on node log02 exited on
signal 11 (Segmentation fault).
--------------------------------------------------------------------------

MPICH instead of Open MPI

If I specify mpich build, it does not create any problem regardless of the mpi4py version.

Testing with the following environment

conda create --name test-env python mpi4py=3.1.5 mpi=1.0=mpich
conda activate test-env
mpiexec -n 4 python -m mpi4py.bench ringtest
mpiexec -n 4 python -m mpi4py.bench helloworld

gives

time for 1 loops = 0.000136181 seconds (4 processes, 1 bytes)

and

Hello, World! I am process 0 of 4 on log02.
Hello, World! I am process 1 of 4 on log02.
Hello, World! I am process 2 of 4 on log02.
Hello, World! I am process 3 of 4 on log02.

Testing with the following environment

conda create --name test-env python mpi4py=3.1.6 mpi=1.0=mpich
conda activate test-env
mpiexec -n 4 python -m mpi4py.bench ringtest
mpiexec -n 4 python -m mpi4py.bench helloworld

gives

time for 1 loops = 0.000106991 seconds (4 processes, 1 bytes)

and

Hello, World! I am process 0 of 4 on log02.
Hello, World! I am process 1 of 4 on log02.
Hello, World! I am process 2 of 4 on log02.
Hello, World! I am process 3 of 4 on log02.
dalcinl commented 4 months ago

I'm not able to reproduce the segfault with the conda-forge packages installed with micromamba, not in a Fedora 30 host or under Ubuntu 22.04 under docker. I have no idea what's going on. You may have to ask the Open MPI community for further help on how to properly debug the issue.

yonghoonlee commented 4 months ago

Thank you @dalcinl for your help. I will further seek for help from Open MPI community. In the meantime, I got to know that MPICH works fine thanks to our earlier discussions, so I still can run my tasks with MPICH and the issue on Open MPI is not too much disruptive for my work at this moment.

piyueh commented 4 months ago

@yonghoonlee @dalcinl I could reproduce the same segmentation fault on a shared HPC cluster with openmpi=5.0.3=h9a79eee_110 and mpi4py=3.1.6=py312hae4ded5_1. However, on my local machine, which runs Arch Linux, I did not have this issue. The last openmpi conda package that worked fine on that cluster is openmpi=4.1.5.

UPDATE: problem resolved by installing ucx with openmpi=5.0.3. See the comment below for more details.

piyueh commented 4 months ago

@yonghoonlee @dalcinl I finally resolved my issue by installing ucx. Not sure why, but I need ucx to make openmpi=5.0.3 work. Hopefully this information can help others.

Note that one must explicitly specify the version with openmpi=5.0.3 when creating the environment. Otherwise, the latest ucx seems to automatically choose openmpi=4.1.6.

For example, to do a quick test:

  1. mamba create -n test python=3.12 openmpi=5.0.3 ucx mpi4py=3.1.6
  2. mamba activate test
  3. mpiexec -n 4 python -m mpi4py.bench ringtest

Now I did not get the segmentation fault.

dalcinl commented 4 months ago

@yonghoonlee When you get the segfault, is UCX available externally in the system?

Is there any chance you can run under valgrind to try seeing where exactly the thing segfaults? For example:

curl -O https://raw.githubusercontent.com/mpi4py/mpi4py/master/demo/helloworld.c
mpicc helloworld.c -o helloworld.exe
mamba install valgrind
mpiexec -n 1 valgrind ./helloworld.exe

@leofang This may mean that our way of disabling UCX by default is broken. Could it be that the configuration in $PREFIX/etc/openmpi-mca-params.conf is not being honored? Or maybe Open MPI has a bug, and if the UCX libraries are not found, it segfaults rather than bailing out and use another component?

piyueh commented 4 months ago

@yonghoonlee When you get the segfault, is UCX available externally in the system?

Is there any chance you can run under valgrind to try seeing where exactly the thing segfaults? For example:

curl -O https://raw.githubusercontent.com/mpi4py/mpi4py/master/demo/helloworld.c
mpicc helloworld.c -o helloworld.exe
mamba install valgrind
mpiexec -n 1 valgrind ./helloworld.exe

@leofang This may mean that our way of disabling UCX by default is broken. Could it be that the configuration in $PREFIX/etc/openmpi-mca-params.conf is not being honored? Or maybe Open MPI has a bug, and if the UCX libraries are not found, it segfaults rather than bailing out and use another component?

Hi @dalcinl , I'm not @yonghoonlee, but I took a look on my side as well. And yes, when I got segfault on the shared cluster, the cluster did have an old version of UCX (v1.8) in the system's runtime search path (libuct.so, libucs.so, libucp.so, and libucm.so). On the other hand, my local machine (which did not give the segfault error) does not have UCX at all.

I tested the helloworld with both GDB and Valgrind. Here's GDB backtrace (paths truncated):

#0  0x000003ff00000001 in ?? ()
#1  0x00002aaaaea3c37a in mca_btl_uct_tl_progress.part () from .../envs/test/lib/openmpi/mca_btl_uct.so
#2  0x00002aaaaea3c667 in mca_btl_uct_component_progress () from .../envs/test/lib/openmpi/mca_btl_uct.so
#3  0x00002aaaab2d6bd3 in opal_progress () from .../envs/test/lib/./libopen-pal.so.80
#4  0x00002aaaaab3a69a in ompi_mpi_instance_init_common () from .../envs/test/lib/libmpi.so.40
#5  0x00002aaaaab3a785 in ompi_mpi_instance_init () from .../envs/test/lib/libmpi.so.40
#6  0x00002aaaaab2e360 in ompi_mpi_init () from .../envs/test/lib/libmpi.so.40
#7  0x00002aaaaab61451 in PMPI_Init_thread () from .../envs/test/lib/libmpi.so.40
#8  0x00005555555551ae in main (argc=1, argv=0x7fffffff79b8) at helloworld.c:11

Valgrind first showed the same thing as the above but then showed an extra block:

==57655==    at 0x843C229: x86_64_fallback_frame_state (md-unwind-support.h:63)
==57655==    by 0x843C229: uw_frame_state_for (unwind-dw2.c:1013)
==57655==    by 0x843D5C3: _Unwind_Backtrace (unwind.inc:303)
==57655==    by 0x4D54CD5: backtrace (in /usr/lib64/libc-2.17.so)
==57655==    by 0x96E102A: ??? (in /usr/lib64/libucs.so.0.0.0)
==57655==    by 0x96E130B: ucs_debug_backtrace_create (in /usr/lib64/libucs.so.0.0.0)
==57655==    by 0x96E1883: ??? (in /usr/lib64/libucs.so.0.0.0)
==57655==    by 0x96E3C7F: ucs_handle_error (in /usr/lib64/libucs.so.0.0.0)
==57655==    by 0x96E400B: ??? (in /usr/lib64/libucs.so.0.0.0)
==57655==    by 0x96E41C1: ??? (in /usr/lib64/libucs.so.0.0.0)
==57655==    by 0x4A3362F: ??? (in /usr/lib64/libpthread-2.17.so)
==57655==    by 0x3FF00000000: ???
==57655==  Address 0x3ff00000001 is not stack'd, malloc'd or (recently) free'd

Valgrind showed the involvement of the system's UCX libraries (e.g., /usr/lib64/libucs.so.0.0.0)

To summarize: I think the segfault error happened when an old version (or an incompatible version) of UCX exists in the runtime search path, even though this UCX is not managed by the conda environment.

dalcinl commented 4 months ago

@jsquyres Is there anything Open MPI could do on their side to detect incompatible compile-time vs runtime-time UCX versions?

yonghoonlee commented 4 months ago

@dalcinl

I do not know how I interpret the response of valgrind, but here's what I got:

mpiexec -n 1 valgrind ./helloworld

gives

==1970663== Memcheck, a memory error detector
==1970663== Copyright (C) 2002-2024, and GNU GPL'd, by Julian Seward et al.
==1970663== Using Valgrind-3.23.0 and LibVEX; rerun with -h for copyright info
==1970663== Command: ./helloworld
==1970663== 
==1970663== WARNING: valgrind ignores shmget(shmflg) SHM_HUGETLB
==1970663== Conditional jump or move depends on uninitialised value(s)
==1970663==    at 0xAA36382: ??? (in /usr/lib64/libibverbs.so.1.14.43.0)
==1970663==    by 0xAA36B63: ibv_cmd_create_srq (in /usr/lib64/libibverbs.so.1.14.43.0)
==1970663==    by 0xAC9E43E: ??? (in /usr/lib64/libmlx5.so.1.24.43.0)
==1970663==    by 0xAA3F0DA: ibv_create_srq (in /usr/lib64/libibverbs.so.1.14.43.0)
==1970663==    by 0xA7CA596: uct_rc_iface_init_rx (in /usr/lib64/ucx/libuct_ib.so.0.0.0)
==1970663==    by 0xA7CAB00: uct_rc_iface_t_init (in /usr/lib64/ucx/libuct_ib.so.0.0.0)
==1970663==    by 0xA7CE8AF: ??? (in /usr/lib64/ucx/libuct_ib.so.0.0.0)
==1970663==    by 0xA7CEED9: ??? (in /usr/lib64/ucx/libuct_ib.so.0.0.0)
==1970663==    by 0x9E2CE6A: uct_iface_open (in /usr/lib64/libuct.so.0.0.0)
==1970663==    by 0x95705BA: mca_btl_uct_context_create (in /home/yhlee/miniforge3/envs/test-env/lib/openmpi/mca_btl_uct.so)
==1970663==    by 0x9570977: mca_btl_uct_query_tls (in /home/yhlee/miniforge3/envs/test-env/lib/openmpi/mca_btl_uct.so)
==1970663==    by 0x956AF10: mca_btl_uct_component_init (in /home/yhlee/miniforge3/envs/test-env/lib/openmpi/mca_btl_uct.so)
==1970663== 
==1970663== Thread 2:
==1970663== Syscall param writev(vector[1]) points to uninitialised byte(s)
==1970663==    at 0x4D76F4F: writev (in /usr/lib64/libc-2.28.so)
==1970663==    by 0x540D3C3: pmix_ptl_base_send_handler (in /home/yhlee/miniforge3/envs/test-env/lib/libpmix.so.2.13.2)
==1970663==    by 0x41FC94D: event_process_active_single_queue (in /home/yhlee/miniforge3/envs/test-env/lib/libevent_core-2.1.so.7.0.1)
==1970663==    by 0x41FD266: event_base_loop (in /home/yhlee/miniforge3/envs/test-env/lib/libevent_core-2.1.so.7.0.1)
==1970663==    by 0x538CD19: progress_engine (in /home/yhlee/miniforge3/envs/test-env/lib/libpmix.so.2.13.2)
==1970663==    by 0x4A391C9: start_thread (pthread_create.c:479)
==1970663==    by 0x4C8AE72: clone (in /usr/lib64/libc-2.28.so)
==1970663==  Address 0xf1b1e8d is 29 bytes inside a block of size 512 alloc'd
==1970663==    at 0x4042DDC: realloc (vg_replace_malloc.c:1800)
==1970663==    by 0x53D99EA: pmix_bfrop_buffer_extend (in /home/yhlee/miniforge3/envs/test-env/lib/libpmix.so.2.13.2)
==1970663==    by 0x53E2513: pmix_bfrops_base_pack_byte (in /home/yhlee/miniforge3/envs/test-env/lib/libpmix.so.2.13.2)
==1970663==    by 0x53E2E80: pmix_bfrops_base_pack_buf (in /home/yhlee/miniforge3/envs/test-env/lib/libpmix.so.2.13.2)
==1970663==    by 0x53E2218: pmix_bfrops_base_pack (in /home/yhlee/miniforge3/envs/test-env/lib/libpmix.so.2.13.2)
==1970663==    by 0x532BE3C: _commitfn (in /home/yhlee/miniforge3/envs/test-env/lib/libpmix.so.2.13.2)
==1970663==    by 0x41FCA75: event_process_active_single_queue (in /home/yhlee/miniforge3/envs/test-env/lib/libevent_core-2.1.so.7.0.1)
==1970663==    by 0x41FD266: event_base_loop (in /home/yhlee/miniforge3/envs/test-env/lib/libevent_core-2.1.so.7.0.1)
==1970663==    by 0x538CD19: progress_engine (in /home/yhlee/miniforge3/envs/test-env/lib/libpmix.so.2.13.2)
==1970663==    by 0x4A391C9: start_thread (pthread_create.c:479)
==1970663==    by 0x4C8AE72: clone (in /usr/lib64/libc-2.28.so)
==1970663== 
==1970663== Thread 1:
==1970663== Jump to the invalid address stated on the next line
==1970663==    at 0x3FF00000001: ???
==1970663==    by 0x956A379: mca_btl_uct_tl_progress.part.0 (in /home/yhlee/miniforge3/envs/test-env/lib/openmpi/mca_btl_uct.so)
==1970663==    by 0x956A666: mca_btl_uct_component_progress (in /home/yhlee/miniforge3/envs/test-env/lib/openmpi/mca_btl_uct.so)
==1970663==    by 0x5033BD2: opal_progress (in /home/yhlee/miniforge3/envs/test-env/lib/libopen-pal.so.80.0.3)
==1970663==    by 0x40BC699: ompi_mpi_instance_init_common (in /home/yhlee/miniforge3/envs/test-env/lib/libmpi.so.40.40.3)
==1970663==    by 0x40BC784: ompi_mpi_instance_init (in /home/yhlee/miniforge3/envs/test-env/lib/libmpi.so.40.40.3)
==1970663==    by 0x40B035F: ompi_mpi_init (in /home/yhlee/miniforge3/envs/test-env/lib/libmpi.so.40.40.3)
==1970663==    by 0x40E3450: PMPI_Init_thread (in /home/yhlee/miniforge3/envs/test-env/lib/libmpi.so.40.40.3)
==1970663==    by 0x1091AD: main (in /home/yhlee/helloworld)
==1970663==  Address 0x3ff00000001 is not stack'd, malloc'd or (recently) free'd
==1970663== 
[log02:1970663:0:1970663] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x3ff00000001)
==1970663== Invalid read of size 1
==1970663==    at 0x83E1229: x86_64_fallback_frame_state (md-unwind-support.h:63)
==1970663==    by 0x83E1229: uw_frame_state_for (unwind-dw2.c:1013)
==1970663==    by 0x83E25C3: _Unwind_Backtrace (unwind.inc:303)
==1970663==    by 0x4D8C4A5: backtrace (in /usr/lib64/libc-2.28.so)
==1970663==    by 0x98BA9F8: ??? (in /usr/lib64/libucs.so.0.0.0)
==1970663==    by 0x98BACDF: ucs_debug_backtrace_create (in /usr/lib64/libucs.so.0.0.0)
==1970663==    by 0x98BB243: ??? (in /usr/lib64/libucs.so.0.0.0)
==1970663==    by 0x98BD97F: ucs_handle_error (in /usr/lib64/libucs.so.0.0.0)
==1970663==    by 0x98BDB6B: ??? (in /usr/lib64/libucs.so.0.0.0)
==1970663==    by 0x98BDD39: ??? (in /usr/lib64/libucs.so.0.0.0)
==1970663==    by 0x4A43CEF: ??? (in /usr/lib64/libpthread-2.28.so)
==1970663==    by 0x3FF00000000: ???
==1970663==  Address 0x3ff00000001 is not stack'd, malloc'd or (recently) free'd
==1970663== 
==1970663== 
==1970663== Process terminating with default action of signal 11 (SIGSEGV)
==1970663==  Access not within mapped region at address 0x3FF00000001
==1970663==    at 0x83E1229: x86_64_fallback_frame_state (md-unwind-support.h:63)
==1970663==    by 0x83E1229: uw_frame_state_for (unwind-dw2.c:1013)
==1970663==    by 0x83E25C3: _Unwind_Backtrace (unwind.inc:303)
==1970663==    by 0x4D8C4A5: backtrace (in /usr/lib64/libc-2.28.so)
==1970663==    by 0x98BA9F8: ??? (in /usr/lib64/libucs.so.0.0.0)
==1970663==    by 0x98BACDF: ucs_debug_backtrace_create (in /usr/lib64/libucs.so.0.0.0)
==1970663==    by 0x98BB243: ??? (in /usr/lib64/libucs.so.0.0.0)
==1970663==    by 0x98BD97F: ucs_handle_error (in /usr/lib64/libucs.so.0.0.0)
==1970663==    by 0x98BDB6B: ??? (in /usr/lib64/libucs.so.0.0.0)
==1970663==    by 0x98BDD39: ??? (in /usr/lib64/libucs.so.0.0.0)
==1970663==    by 0x4A43CEF: ??? (in /usr/lib64/libpthread-2.28.so)
==1970663==    by 0x3FF00000000: ???
==1970663==  If you believe this happened as a result of a stack
==1970663==  overflow in your program's main thread (unlikely but
==1970663==  possible), you can try to increase the size of the
==1970663==  main thread stack using the --main-stacksize= flag.
==1970663==  The main thread stack size used in this run was 8388608.
==1970663== 
==1970663== HEAP SUMMARY:
==1970663==     in use at exit: 6,186,949 bytes in 18,884 blocks
==1970663==   total heap usage: 57,148 allocs, 38,264 frees, 15,172,886 bytes allocated
==1970663== 
==1970663== LEAK SUMMARY:
==1970663==    definitely lost: 767 bytes in 29 blocks
==1970663==    indirectly lost: 51,862 bytes in 13 blocks
==1970663==      possibly lost: 277,266 bytes in 45 blocks
==1970663==    still reachable: 5,856,966 bytes in 18,794 blocks
==1970663==         suppressed: 88 bytes in 3 blocks
==1970663== Rerun with --leak-check=full to see details of leaked memory
==1970663== 
==1970663== Use --track-origins=yes to see where uninitialised values come from
==1970663== For lists of detected and suppressed errors, rerun with: -s
==1970663== ERROR SUMMARY: 4 errors from 4 contexts (suppressed: 0 from 0)
--------------------------------------------------------------------------
prterun noticed that process rank 0 with PID 1970663 on node log02 exited on
signal 11 (Segmentation fault).
--------------------------------------------------------------------------
yonghoonlee commented 4 months ago

@piyueh

I don't think there was a UCX loaded by default. I tried to rely on auto-resolved dependency chains that conda installs. As you suggested, if I specify openmpi=5.0.3 and install ucx along with that, it works fine. Thank you for sharing your resolution, while it might need to be properly addressed by openmpi side. Thank you!

dalcinl commented 4 months ago

I don't think there was a UCX loaded by default.

Well, I do see uct_* symbols in your valgrind output, therefore it is somehow being used.

EDIT: Maybe this is actually an issue/bug in the older UCX installed in your system.

yonghoonlee commented 3 months ago

EDIT: Maybe this is actually an issue/bug in the older UCX installed in your system.

Thanks. I will check with the UCX version installed on my system.