Closed yonghoonlee closed 4 months ago
There is another clue to provide. If I run mpirun using slurm script, it shows slightly more detailed error message:
cpu-bind=MASK - ac07, task 0 0 [2583294]: mask 0x4000000000000040040040 set
[ac07:2583382:0:2583382] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x3ff00000001)
[ac07:2583380:0:2583380] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x3ff00000001)
[ac07:2583381:0:2583381] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x3ff00000001)
[ac07:2583383:0:2583383] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x3ff00000001)
--------------------------------------------------------------------------
prterun noticed that process rank 2 with PID 2583382 on node ac07 exited on
signal 11 (Segmentation fault).
--------------------------------------------------------------------------
The slurm script I use is
#!/bin/bash
#SBATCH --ntasks 4
#SBATCH --cpus-per-task 1
#SBATCH --time 00:01:00
#SBATCH --mem-per-cpu 1G
#SBATCH --partition acomputeq
#SBATCH --job-name mpi
#SBATCH --output mpi-%J.out
#SBATCH --error mpi-%J.err
module purge
source ~/.bashrc
source activate test-env
mpirun python test.py
If I downgrade to a different version of dependencies, it works. For example, if I create the conda environment with forcing mpi4py=3.1.5 as
module purge # I purge all modules and use conda to automatically determine dependencies
conda create -y --name test-env python mpi4py=3.1.5
conda activate test-env
mpirun -np 4 python test.py
Then the result comes out without the segmentation fault, as:
0
1
2
3
Here, the conda list is:
# packages in environment at /home/yhlee/miniforge3/envs/test-env:
#
# Name Version Build Channel
_libgcc_mutex 0.1 conda_forge conda-forge
_openmp_mutex 4.5 2_gnu conda-forge
bzip2 1.0.8 h4bc722e_7 conda-forge
ca-certificates 2024.7.4 hbcca054_0 conda-forge
ld_impl_linux-64 2.40 hf3520f5_7 conda-forge
libexpat 2.6.2 h59595ed_0 conda-forge
libffi 3.4.2 h7f98852_5 conda-forge
libgcc-ng 14.1.0 h77fa898_0 conda-forge
libgfortran-ng 14.1.0 h69a702a_0 conda-forge
libgfortran5 14.1.0 hc5f4f2c_0 conda-forge
libgomp 14.1.0 h77fa898_0 conda-forge
libnsl 2.0.1 hd590300_0 conda-forge
libsqlite 3.46.0 hde9e2c9_0 conda-forge
libstdcxx-ng 14.1.0 hc0a3c3a_0 conda-forge
libuuid 2.38.1 h0b41bf4_0 conda-forge
libxcrypt 4.4.36 hd590300_1 conda-forge
libzlib 1.3.1 h4ab18f5_1 conda-forge
mpi 1.0 mpich conda-forge
mpi4py 3.1.5 py312h5256a87_1 conda-forge
mpich 4.2.2 h4a7f18d_100 conda-forge
ncurses 6.5 h59595ed_0 conda-forge
openssl 3.3.1 h4bc722e_2 conda-forge
pip 24.0 pyhd8ed1ab_0 conda-forge
python 3.12.4 h194c7f8_0_cpython conda-forge
python_abi 3.12 4_cp312 conda-forge
readline 8.2 h8228510_1 conda-forge
setuptools 71.0.4 pyhd8ed1ab_0 conda-forge
tk 8.6.13 noxft_h4845f30_101 conda-forge
tzdata 2024a h0c530f3_0 conda-forge
wheel 0.43.0 pyhd8ed1ab_1 conda-forge
xz 5.2.6 h166bdaf_0 conda-forge
@yonghoonlee Note that when you install mpi4py 3.1.5, this time you are not getting Open MPI but MPICH. That is because you are not explicitly telling conda what's the MPI you want. Please add openmpi
explicitily to the list of packages in the conda create ...
invocation.
Also, once you create your environments with mpi4py=3.1.6 and openmpi explicitly, can you try running the following and show us the output?
python -m mpi4py --prefix
python -m mpi4py --mpi-lib-version
mpiexec -n 4 python -m mpi4py.bench ringtest
mpiexec -n 4 python -m mpi4py.bench helloworld
Conda environment creation
conda create --name test-env python mpi4py=3.1.6 openmpi
conda activate test-env
Command entered
python -m mpi4py --prefix
Response
/home/yhlee/miniforge3/envs/test-env/lib/python3.12/site-packages/mpi4py
Command entered
python -m mpi4py --mpi-lib-version
Response
Open MPI v5.0.3, package: Open MPI conda@2a526d73f59c Distribution, ident: 5.0.3, repo rev: v5.0.3, Apr 08, 2024
Command entered
mpiexec -n 4 python -m mpi4py.bench ringtest
Response
[log02:837208:0:837208] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x3ff00000001)
[log02:837211:0:837211] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x3ff00000001)
[log02:837209:0:837209] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x3ff00000001)
[log02:837210:0:837210] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x3ff00000001)
--------------------------------------------------------------------------
prterun noticed that process rank 0 with PID 837208 on node log02 exited on
signal 11 (Segmentation fault).
--------------------------------------------------------------------------
mpiexec -n 4 python -m mpi4py.bench helloworld
[log02:838070:0:838070] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x3ff00000001)
[log02:838073:0:838073] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x3ff00000001)
[log02:838071:0:838071] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x3ff00000001)
[log02:838072:0:838072] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x3ff00000001)
--------------------------------------------------------------------------
prterun noticed that process rank 0 with PID 838070 on node log02 exited on
signal 11 (Segmentation fault).
--------------------------------------------------------------------------
Conda environment package list
conda list
Response
# packages in environment at /home/yhlee/miniforge3/envs/test-env:
#
# Name Version Build Channel
_libgcc_mutex 0.1 conda_forge conda-forge
_openmp_mutex 4.5 2_gnu conda-forge
bzip2 1.0.8 h4bc722e_7 conda-forge
ca-certificates 2024.7.4 hbcca054_0 conda-forge
icu 75.1 he02047a_0 conda-forge
ld_impl_linux-64 2.40 hf3520f5_7 conda-forge
libevent 2.1.12 hf998b51_1 conda-forge
libexpat 2.6.2 h59595ed_0 conda-forge
libffi 3.4.2 h7f98852_5 conda-forge
libgcc-ng 14.1.0 h77fa898_0 conda-forge
libgfortran-ng 14.1.0 h69a702a_0 conda-forge
libgfortran5 14.1.0 hc5f4f2c_0 conda-forge
libgomp 14.1.0 h77fa898_0 conda-forge
libhwloc 2.11.1 default_hecaa2ac_1000 conda-forge
libiconv 1.17 hd590300_2 conda-forge
libnl 3.10.0 h4bc722e_0 conda-forge
libnsl 2.0.1 hd590300_0 conda-forge
libsqlite 3.46.0 hde9e2c9_0 conda-forge
libstdcxx-ng 14.1.0 hc0a3c3a_0 conda-forge
libuuid 2.38.1 h0b41bf4_0 conda-forge
libxcrypt 4.4.36 hd590300_1 conda-forge
libxml2 2.12.7 he7c6b58_4 conda-forge
libzlib 1.3.1 h4ab18f5_1 conda-forge
mpi 1.0 openmpi conda-forge
mpi4py 3.1.6 py312hae4ded5_1 conda-forge
ncurses 6.5 h59595ed_0 conda-forge
openmpi 5.0.3 h9a79eee_110 conda-forge
openssl 3.3.1 h4bc722e_2 conda-forge
pip 24.0 pyhd8ed1ab_0 conda-forge
python 3.12.4 h194c7f8_0_cpython conda-forge
python_abi 3.12 4_cp312 conda-forge
readline 8.2 h8228510_1 conda-forge
setuptools 71.0.4 pyhd8ed1ab_0 conda-forge
tk 8.6.13 noxft_h4845f30_101 conda-forge
tzdata 2024a h0c530f3_0 conda-forge
wheel 0.43.0 pyhd8ed1ab_1 conda-forge
xz 5.2.6 h166bdaf_0 conda-forge
Conda environment info
conda info
Response
active environment : test-env
active env location : /home/yhlee/miniforge3/envs/test-env
shell level : 2
user config file : /home/yhlee/.condarc
populated config files : /home/yhlee/miniforge3/.condarc
/home/yhlee/.condarc
conda version : 24.5.0
conda-build version : not installed
python version : 3.10.14.final.0
solver : libmamba (default)
virtual packages : __archspec=1=zen4
__conda=24.5.0=0
__glibc=2.28=0
__linux=4.18.0=0
__unix=0=0
base environment : /home/yhlee/miniforge3 (writable)
conda av data dir : /home/yhlee/miniforge3/etc/conda
conda av metadata url : None
channel URLs : https://conda.anaconda.org/conda-forge/linux-64
https://conda.anaconda.org/conda-forge/noarch
package cache : /home/yhlee/miniforge3/pkgs
/home/yhlee/.conda/pkgs
envs directories : /home/yhlee/miniforge3/envs
/home/yhlee/.conda/envs
platform : linux-64
user-agent : conda/24.5.0 requests/2.31.0 CPython/3.10.14 Linux/4.18.0-477.10.1.el8_8.x86_64 rocky/8.8 glibc/2.28 solver/libmamba conda-libmamba-solver/24.1.0 libmambapy/1.5.8
UID:GID : 63599:100
netrc file : None
offline mode : False
Thank you.
@yonghoonlee Note that when you install mpi4py 3.1.5, this time you are not getting Open MPI but MPICH. That is because you are not explicitly telling conda what's the MPI you want. Please add
openmpi
explicitily to the list of packages in theconda create ...
invocation.
Yes, that looks true. I will test both mpi4py=3.1.5 and 3.1.6 with openmpi.
@dalcinl I performed a few more tests.
conda create --name test-env python mpi4py=3.1.5 openmpi
contains openmpi=4.1.6
conda create --name test-env python mpi4py=3.1.6 openmpi
contains openmpi=5.0.3
They behave differently.
mpi4py=3.1.5
(which installs openmpi=4.1.6
)mpiexec -n 4 python -m mpi4py.bench ringtest
gives
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.
Local host: log02
Local device: mlx5_0
--------------------------------------------------------------------------
time for 1 loops = 6.2891e-05 seconds (4 processes, 1 bytes)
[log02:869941] 3 more processes have sent help message help-mpi-btl-openib.txt / error in device init
[log02:869941] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
mpi4py=3.1.6
(which installs openmpi=5.0.3
)mpiexec -n 4 python -m mpi4py.bench ringtest
gives
[log02:837208:0:837208] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x3ff00000001)
[log02:837211:0:837211] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x3ff00000001)
[log02:837209:0:837209] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x3ff00000001)
[log02:837210:0:837210] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x3ff00000001)
--------------------------------------------------------------------------
prterun noticed that process rank 0 with PID 837208 on node log02 exited on
signal 11 (Segmentation fault).
--------------------------------------------------------------------------
If I specify mpich build, it does not create any problem regardless of the mpi4py version.
Testing with the following environment
conda create --name test-env python mpi4py=3.1.5 mpi=1.0=mpich
conda activate test-env
mpiexec -n 4 python -m mpi4py.bench ringtest
mpiexec -n 4 python -m mpi4py.bench helloworld
gives
time for 1 loops = 0.000136181 seconds (4 processes, 1 bytes)
and
Hello, World! I am process 0 of 4 on log02.
Hello, World! I am process 1 of 4 on log02.
Hello, World! I am process 2 of 4 on log02.
Hello, World! I am process 3 of 4 on log02.
Testing with the following environment
conda create --name test-env python mpi4py=3.1.6 mpi=1.0=mpich
conda activate test-env
mpiexec -n 4 python -m mpi4py.bench ringtest
mpiexec -n 4 python -m mpi4py.bench helloworld
gives
time for 1 loops = 0.000106991 seconds (4 processes, 1 bytes)
and
Hello, World! I am process 0 of 4 on log02.
Hello, World! I am process 1 of 4 on log02.
Hello, World! I am process 2 of 4 on log02.
Hello, World! I am process 3 of 4 on log02.
I'm not able to reproduce the segfault with the conda-forge packages installed with micromamba, not in a Fedora 30 host or under Ubuntu 22.04 under docker. I have no idea what's going on. You may have to ask the Open MPI community for further help on how to properly debug the issue.
Thank you @dalcinl for your help. I will further seek for help from Open MPI community. In the meantime, I got to know that MPICH works fine thanks to our earlier discussions, so I still can run my tasks with MPICH and the issue on Open MPI is not too much disruptive for my work at this moment.
@yonghoonlee @dalcinl I could reproduce the same segmentation fault on a shared HPC cluster with openmpi=5.0.3=h9a79eee_110
and mpi4py=3.1.6=py312hae4ded5_1
. However, on my local machine, which runs Arch Linux, I did not have this issue. The last openmpi
conda package that worked fine on that cluster is openmpi=4.1.5
.
UPDATE: problem resolved by installing ucx
with openmpi=5.0.3
. See the comment below for more details.
@yonghoonlee @dalcinl I finally resolved my issue by installing ucx
. Not sure why, but I need ucx
to make openmpi=5.0.3
work. Hopefully this information can help others.
Note that one must explicitly specify the version with openmpi=5.0.3
when creating the environment. Otherwise, the latest ucx
seems to automatically choose openmpi=4.1.6
.
For example, to do a quick test:
mamba create -n test python=3.12 openmpi=5.0.3 ucx mpi4py=3.1.6
mamba activate test
mpiexec -n 4 python -m mpi4py.bench ringtest
Now I did not get the segmentation fault.
@yonghoonlee When you get the segfault, is UCX available externally in the system?
Is there any chance you can run under valgrind to try seeing where exactly the thing segfaults? For example:
curl -O https://raw.githubusercontent.com/mpi4py/mpi4py/master/demo/helloworld.c
mpicc helloworld.c -o helloworld.exe
mamba install valgrind
mpiexec -n 1 valgrind ./helloworld.exe
@leofang This may mean that our way of disabling UCX by default is broken.
Could it be that the configuration in $PREFIX/etc/openmpi-mca-params.conf
is not being honored?
Or maybe Open MPI has a bug, and if the UCX libraries are not found, it segfaults rather than bailing out and use another component?
@yonghoonlee When you get the segfault, is UCX available externally in the system?
Is there any chance you can run under valgrind to try seeing where exactly the thing segfaults? For example:
curl -O https://raw.githubusercontent.com/mpi4py/mpi4py/master/demo/helloworld.c mpicc helloworld.c -o helloworld.exe mamba install valgrind mpiexec -n 1 valgrind ./helloworld.exe
@leofang This may mean that our way of disabling UCX by default is broken. Could it be that the configuration in
$PREFIX/etc/openmpi-mca-params.conf
is not being honored? Or maybe Open MPI has a bug, and if the UCX libraries are not found, it segfaults rather than bailing out and use another component?
Hi @dalcinl , I'm not @yonghoonlee, but I took a look on my side as well. And yes, when I got segfault on the shared cluster, the cluster did have an old version of UCX (v1.8) in the system's runtime search path (libuct.so
, libucs.so
, libucp.so
, and libucm.so
). On the other hand, my local machine (which did not give the segfault error) does not have UCX at all.
I tested the helloworld with both GDB and Valgrind. Here's GDB backtrace (paths truncated):
#0 0x000003ff00000001 in ?? ()
#1 0x00002aaaaea3c37a in mca_btl_uct_tl_progress.part () from .../envs/test/lib/openmpi/mca_btl_uct.so
#2 0x00002aaaaea3c667 in mca_btl_uct_component_progress () from .../envs/test/lib/openmpi/mca_btl_uct.so
#3 0x00002aaaab2d6bd3 in opal_progress () from .../envs/test/lib/./libopen-pal.so.80
#4 0x00002aaaaab3a69a in ompi_mpi_instance_init_common () from .../envs/test/lib/libmpi.so.40
#5 0x00002aaaaab3a785 in ompi_mpi_instance_init () from .../envs/test/lib/libmpi.so.40
#6 0x00002aaaaab2e360 in ompi_mpi_init () from .../envs/test/lib/libmpi.so.40
#7 0x00002aaaaab61451 in PMPI_Init_thread () from .../envs/test/lib/libmpi.so.40
#8 0x00005555555551ae in main (argc=1, argv=0x7fffffff79b8) at helloworld.c:11
Valgrind first showed the same thing as the above but then showed an extra block:
==57655== at 0x843C229: x86_64_fallback_frame_state (md-unwind-support.h:63)
==57655== by 0x843C229: uw_frame_state_for (unwind-dw2.c:1013)
==57655== by 0x843D5C3: _Unwind_Backtrace (unwind.inc:303)
==57655== by 0x4D54CD5: backtrace (in /usr/lib64/libc-2.17.so)
==57655== by 0x96E102A: ??? (in /usr/lib64/libucs.so.0.0.0)
==57655== by 0x96E130B: ucs_debug_backtrace_create (in /usr/lib64/libucs.so.0.0.0)
==57655== by 0x96E1883: ??? (in /usr/lib64/libucs.so.0.0.0)
==57655== by 0x96E3C7F: ucs_handle_error (in /usr/lib64/libucs.so.0.0.0)
==57655== by 0x96E400B: ??? (in /usr/lib64/libucs.so.0.0.0)
==57655== by 0x96E41C1: ??? (in /usr/lib64/libucs.so.0.0.0)
==57655== by 0x4A3362F: ??? (in /usr/lib64/libpthread-2.17.so)
==57655== by 0x3FF00000000: ???
==57655== Address 0x3ff00000001 is not stack'd, malloc'd or (recently) free'd
Valgrind showed the involvement of the system's UCX libraries (e.g., /usr/lib64/libucs.so.0.0.0
)
To summarize: I think the segfault error happened when an old version (or an incompatible version) of UCX exists in the runtime search path, even though this UCX is not managed by the conda environment.
@jsquyres Is there anything Open MPI could do on their side to detect incompatible compile-time vs runtime-time UCX versions?
@dalcinl
I do not know how I interpret the response of valgrind, but here's what I got:
mpiexec -n 1 valgrind ./helloworld
gives
==1970663== Memcheck, a memory error detector
==1970663== Copyright (C) 2002-2024, and GNU GPL'd, by Julian Seward et al.
==1970663== Using Valgrind-3.23.0 and LibVEX; rerun with -h for copyright info
==1970663== Command: ./helloworld
==1970663==
==1970663== WARNING: valgrind ignores shmget(shmflg) SHM_HUGETLB
==1970663== Conditional jump or move depends on uninitialised value(s)
==1970663== at 0xAA36382: ??? (in /usr/lib64/libibverbs.so.1.14.43.0)
==1970663== by 0xAA36B63: ibv_cmd_create_srq (in /usr/lib64/libibverbs.so.1.14.43.0)
==1970663== by 0xAC9E43E: ??? (in /usr/lib64/libmlx5.so.1.24.43.0)
==1970663== by 0xAA3F0DA: ibv_create_srq (in /usr/lib64/libibverbs.so.1.14.43.0)
==1970663== by 0xA7CA596: uct_rc_iface_init_rx (in /usr/lib64/ucx/libuct_ib.so.0.0.0)
==1970663== by 0xA7CAB00: uct_rc_iface_t_init (in /usr/lib64/ucx/libuct_ib.so.0.0.0)
==1970663== by 0xA7CE8AF: ??? (in /usr/lib64/ucx/libuct_ib.so.0.0.0)
==1970663== by 0xA7CEED9: ??? (in /usr/lib64/ucx/libuct_ib.so.0.0.0)
==1970663== by 0x9E2CE6A: uct_iface_open (in /usr/lib64/libuct.so.0.0.0)
==1970663== by 0x95705BA: mca_btl_uct_context_create (in /home/yhlee/miniforge3/envs/test-env/lib/openmpi/mca_btl_uct.so)
==1970663== by 0x9570977: mca_btl_uct_query_tls (in /home/yhlee/miniforge3/envs/test-env/lib/openmpi/mca_btl_uct.so)
==1970663== by 0x956AF10: mca_btl_uct_component_init (in /home/yhlee/miniforge3/envs/test-env/lib/openmpi/mca_btl_uct.so)
==1970663==
==1970663== Thread 2:
==1970663== Syscall param writev(vector[1]) points to uninitialised byte(s)
==1970663== at 0x4D76F4F: writev (in /usr/lib64/libc-2.28.so)
==1970663== by 0x540D3C3: pmix_ptl_base_send_handler (in /home/yhlee/miniforge3/envs/test-env/lib/libpmix.so.2.13.2)
==1970663== by 0x41FC94D: event_process_active_single_queue (in /home/yhlee/miniforge3/envs/test-env/lib/libevent_core-2.1.so.7.0.1)
==1970663== by 0x41FD266: event_base_loop (in /home/yhlee/miniforge3/envs/test-env/lib/libevent_core-2.1.so.7.0.1)
==1970663== by 0x538CD19: progress_engine (in /home/yhlee/miniforge3/envs/test-env/lib/libpmix.so.2.13.2)
==1970663== by 0x4A391C9: start_thread (pthread_create.c:479)
==1970663== by 0x4C8AE72: clone (in /usr/lib64/libc-2.28.so)
==1970663== Address 0xf1b1e8d is 29 bytes inside a block of size 512 alloc'd
==1970663== at 0x4042DDC: realloc (vg_replace_malloc.c:1800)
==1970663== by 0x53D99EA: pmix_bfrop_buffer_extend (in /home/yhlee/miniforge3/envs/test-env/lib/libpmix.so.2.13.2)
==1970663== by 0x53E2513: pmix_bfrops_base_pack_byte (in /home/yhlee/miniforge3/envs/test-env/lib/libpmix.so.2.13.2)
==1970663== by 0x53E2E80: pmix_bfrops_base_pack_buf (in /home/yhlee/miniforge3/envs/test-env/lib/libpmix.so.2.13.2)
==1970663== by 0x53E2218: pmix_bfrops_base_pack (in /home/yhlee/miniforge3/envs/test-env/lib/libpmix.so.2.13.2)
==1970663== by 0x532BE3C: _commitfn (in /home/yhlee/miniforge3/envs/test-env/lib/libpmix.so.2.13.2)
==1970663== by 0x41FCA75: event_process_active_single_queue (in /home/yhlee/miniforge3/envs/test-env/lib/libevent_core-2.1.so.7.0.1)
==1970663== by 0x41FD266: event_base_loop (in /home/yhlee/miniforge3/envs/test-env/lib/libevent_core-2.1.so.7.0.1)
==1970663== by 0x538CD19: progress_engine (in /home/yhlee/miniforge3/envs/test-env/lib/libpmix.so.2.13.2)
==1970663== by 0x4A391C9: start_thread (pthread_create.c:479)
==1970663== by 0x4C8AE72: clone (in /usr/lib64/libc-2.28.so)
==1970663==
==1970663== Thread 1:
==1970663== Jump to the invalid address stated on the next line
==1970663== at 0x3FF00000001: ???
==1970663== by 0x956A379: mca_btl_uct_tl_progress.part.0 (in /home/yhlee/miniforge3/envs/test-env/lib/openmpi/mca_btl_uct.so)
==1970663== by 0x956A666: mca_btl_uct_component_progress (in /home/yhlee/miniforge3/envs/test-env/lib/openmpi/mca_btl_uct.so)
==1970663== by 0x5033BD2: opal_progress (in /home/yhlee/miniforge3/envs/test-env/lib/libopen-pal.so.80.0.3)
==1970663== by 0x40BC699: ompi_mpi_instance_init_common (in /home/yhlee/miniforge3/envs/test-env/lib/libmpi.so.40.40.3)
==1970663== by 0x40BC784: ompi_mpi_instance_init (in /home/yhlee/miniforge3/envs/test-env/lib/libmpi.so.40.40.3)
==1970663== by 0x40B035F: ompi_mpi_init (in /home/yhlee/miniforge3/envs/test-env/lib/libmpi.so.40.40.3)
==1970663== by 0x40E3450: PMPI_Init_thread (in /home/yhlee/miniforge3/envs/test-env/lib/libmpi.so.40.40.3)
==1970663== by 0x1091AD: main (in /home/yhlee/helloworld)
==1970663== Address 0x3ff00000001 is not stack'd, malloc'd or (recently) free'd
==1970663==
[log02:1970663:0:1970663] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x3ff00000001)
==1970663== Invalid read of size 1
==1970663== at 0x83E1229: x86_64_fallback_frame_state (md-unwind-support.h:63)
==1970663== by 0x83E1229: uw_frame_state_for (unwind-dw2.c:1013)
==1970663== by 0x83E25C3: _Unwind_Backtrace (unwind.inc:303)
==1970663== by 0x4D8C4A5: backtrace (in /usr/lib64/libc-2.28.so)
==1970663== by 0x98BA9F8: ??? (in /usr/lib64/libucs.so.0.0.0)
==1970663== by 0x98BACDF: ucs_debug_backtrace_create (in /usr/lib64/libucs.so.0.0.0)
==1970663== by 0x98BB243: ??? (in /usr/lib64/libucs.so.0.0.0)
==1970663== by 0x98BD97F: ucs_handle_error (in /usr/lib64/libucs.so.0.0.0)
==1970663== by 0x98BDB6B: ??? (in /usr/lib64/libucs.so.0.0.0)
==1970663== by 0x98BDD39: ??? (in /usr/lib64/libucs.so.0.0.0)
==1970663== by 0x4A43CEF: ??? (in /usr/lib64/libpthread-2.28.so)
==1970663== by 0x3FF00000000: ???
==1970663== Address 0x3ff00000001 is not stack'd, malloc'd or (recently) free'd
==1970663==
==1970663==
==1970663== Process terminating with default action of signal 11 (SIGSEGV)
==1970663== Access not within mapped region at address 0x3FF00000001
==1970663== at 0x83E1229: x86_64_fallback_frame_state (md-unwind-support.h:63)
==1970663== by 0x83E1229: uw_frame_state_for (unwind-dw2.c:1013)
==1970663== by 0x83E25C3: _Unwind_Backtrace (unwind.inc:303)
==1970663== by 0x4D8C4A5: backtrace (in /usr/lib64/libc-2.28.so)
==1970663== by 0x98BA9F8: ??? (in /usr/lib64/libucs.so.0.0.0)
==1970663== by 0x98BACDF: ucs_debug_backtrace_create (in /usr/lib64/libucs.so.0.0.0)
==1970663== by 0x98BB243: ??? (in /usr/lib64/libucs.so.0.0.0)
==1970663== by 0x98BD97F: ucs_handle_error (in /usr/lib64/libucs.so.0.0.0)
==1970663== by 0x98BDB6B: ??? (in /usr/lib64/libucs.so.0.0.0)
==1970663== by 0x98BDD39: ??? (in /usr/lib64/libucs.so.0.0.0)
==1970663== by 0x4A43CEF: ??? (in /usr/lib64/libpthread-2.28.so)
==1970663== by 0x3FF00000000: ???
==1970663== If you believe this happened as a result of a stack
==1970663== overflow in your program's main thread (unlikely but
==1970663== possible), you can try to increase the size of the
==1970663== main thread stack using the --main-stacksize= flag.
==1970663== The main thread stack size used in this run was 8388608.
==1970663==
==1970663== HEAP SUMMARY:
==1970663== in use at exit: 6,186,949 bytes in 18,884 blocks
==1970663== total heap usage: 57,148 allocs, 38,264 frees, 15,172,886 bytes allocated
==1970663==
==1970663== LEAK SUMMARY:
==1970663== definitely lost: 767 bytes in 29 blocks
==1970663== indirectly lost: 51,862 bytes in 13 blocks
==1970663== possibly lost: 277,266 bytes in 45 blocks
==1970663== still reachable: 5,856,966 bytes in 18,794 blocks
==1970663== suppressed: 88 bytes in 3 blocks
==1970663== Rerun with --leak-check=full to see details of leaked memory
==1970663==
==1970663== Use --track-origins=yes to see where uninitialised values come from
==1970663== For lists of detected and suppressed errors, rerun with: -s
==1970663== ERROR SUMMARY: 4 errors from 4 contexts (suppressed: 0 from 0)
--------------------------------------------------------------------------
prterun noticed that process rank 0 with PID 1970663 on node log02 exited on
signal 11 (Segmentation fault).
--------------------------------------------------------------------------
@piyueh
I don't think there was a UCX loaded by default. I tried to rely on auto-resolved dependency chains that conda installs. As you suggested, if I specify openmpi=5.0.3 and install ucx along with that, it works fine. Thank you for sharing your resolution, while it might need to be properly addressed by openmpi side. Thank you!
I don't think there was a UCX loaded by default.
Well, I do see uct_*
symbols in your valgrind output, therefore it is somehow being used.
EDIT: Maybe this is actually an issue/bug in the older UCX installed in your system.
EDIT: Maybe this is actually an issue/bug in the older UCX installed in your system.
Thanks. I will check with the UCX version installed on my system.
Solution to issue cannot be found in the documentation.
Issue
I experience segmentation fault (signal 11) on the use of MPI on Linux. Originally, I suspected the bug of mpi4py, but it seems that openmpi might be more relevant to this error, as @dalcinl mentioned in https://github.com/mpi4py/mpi4py/issues/523. Here's how the segmentation fault is replicated.
The python code is
The expected result is
The actual result is
Installed packages
Environment info