conda-forge / openmpi-feedstock

A conda-smithy repository for openmpi.
BSD 3-Clause "New" or "Revised" License
9 stars 25 forks source link

CUDA awareness is not built in 4.1.0 #83

Closed leofang closed 3 years ago

leofang commented 3 years ago

Something might go wrong in the recent recipe updates. I can't get mpi4py's GPU example pass:

$ OMPI_MCA_opal_cuda_support=true  mpirun -n 2 python demo/cuda-aware-mpi/use_cupy.py 
--------------------------------------------------------------------------
The user requested CUDA support with the --mca mpi_cuda_support 1 flag
but the library was not compiled with any support.
--------------------------------------------------------------------------
[WCS-164320:29458] *** Process received signal ***
[WCS-164320:29458] Signal: Segmentation fault (11)
[WCS-164320:29458] Signal code: Address not mapped (1)
[WCS-164320:29458] Failing at address: (nil)
[WCS-164320:29458] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x128a0)[0x7f596b2dd8a0]
[WCS-164320:29458] *** End of error message ***
--------------------------------------------------------------------------
The user requested CUDA support with the --mca mpi_cuda_support 1 flag
but the library was not compiled with any support.
--------------------------------------------------------------------------
[WCS-164320:29459] *** Process received signal ***
[WCS-164320:29459] Signal: Segmentation fault (11)
[WCS-164320:29459] Signal code: Address not mapped (1)
[WCS-164320:29459] Failing at address: (nil)
[WCS-164320:29459] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x128a0)[0x7f84ec4898a0]
[WCS-164320:29459] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node WCS-164320 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

Environment:

$ conda list
# packages in environment at /home/leofang/miniconda3/envs/mpi4py_dev:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       1_gnu    conda-forge
binutils_impl_linux-64    2.35.1               h193b22a_2    conda-forge
binutils_linux-64         2.35                h67ddf6f_30    conda-forge
ca-certificates           2020.12.5            ha878542_0    conda-forge
certifi                   2020.12.5        py38h578d9bd_1    conda-forge
cudatoolkit               11.2.1               h8204236_8    conda-forge
cudnn                     8.1.0.77             h90431f1_0    conda-forge
cupy                      8.5.0            py38h69dedff_1    conda-forge
cutensor                  1.2.2.5              h96e36e3_3    conda-forge
cython                    0.29.22          py38h709712a_0    conda-forge
fastrlock                 0.6              py38h709712a_0    conda-forge
gcc_impl_linux-64         9.3.0               h70c0ae5_18    conda-forge
gcc_linux-64              9.3.0               hf25ea35_30    conda-forge
kernel-headers_linux-64   2.6.32              h77966d4_13    conda-forge
ld_impl_linux-64          2.35.1               hea4e1c9_2    conda-forge
libblas                   3.9.0                8_openblas    conda-forge
libcblas                  3.9.0                8_openblas    conda-forge
libffi                    3.3                  h58526e2_2    conda-forge
libgcc-devel_linux-64     9.3.0               h7864c58_18    conda-forge
libgcc-ng                 9.3.0               h2828fa1_18    conda-forge
libgfortran-ng            9.3.0               hff62375_18    conda-forge
libgfortran5              9.3.0               hff62375_18    conda-forge
libgomp                   9.3.0               h2828fa1_18    conda-forge
liblapack                 3.9.0                8_openblas    conda-forge
libopenblas               0.3.12          pthreads_h4812303_1    conda-forge
libstdcxx-ng              9.3.0               h6de172a_18    conda-forge
mpi                       1.0                     openmpi    conda-forge
mpi4py                    3.1.0a0                  pypi_0    pypi
nccl                      2.8.4.1              hdc17891_3    conda-forge
ncurses                   6.2                  h58526e2_4    conda-forge
numpy                     1.20.1           py38h18fd61f_0    conda-forge
openmpi                   4.1.0                hbfc84c5_3    conda-forge
openmpi-mpicc             4.1.0                h7f98852_3    conda-forge
openssl                   1.1.1j               h7f98852_0    conda-forge
pip                       21.0.1             pyhd8ed1ab_0    conda-forge
python                    3.8.8           hffdb5ce_0_cpython    conda-forge
python_abi                3.8                      1_cp38    conda-forge
readline                  8.0                  he28a2e2_2    conda-forge
setuptools                49.6.0           py38h578d9bd_3    conda-forge
sqlite                    3.35.2               h74cdb3f_0    conda-forge
sysroot_linux-64          2.12                h77966d4_13    conda-forge
tk                        8.6.10               h21135ba_1    conda-forge
wheel                     0.36.2             pyhd3deb0d_0    conda-forge
xz                        5.2.5                h516909a_1    conda-forge
zlib                      1.2.11            h516909a_1010    conda-forge
leofang commented 3 years ago

OK, we're lucky here: Downgrading from Open MPI 4.1.0 to 4.0.5 makes it work. This helps narrow down the search...

leofang commented 3 years ago

4.1.0 build 0 is OK too (#71).

leofang commented 3 years ago

This failure starts from build 2 (#80).

leofang commented 3 years ago

This failure starts from build 2 (#80).

Not sure if it's relevant, but I also noticed that the post-link message stops appearing since this build.

jakirkham commented 3 years ago

CUDA 9.2 was dropped in conda-forge recently. So that might explain the issue seen in PR ( https://github.com/conda-forge/openmpi-feedstock/pull/80 )

leofang commented 3 years ago

@jakirkham I forgot to update the version check in the build script 😂 See #84.

jakirkham commented 3 years ago

All good 🙂

leofang commented 3 years ago

Thanks for jumping in John!

leofang commented 3 years ago

@jakirkham not complaining, but it'd be great to have a GPU CI (https://github.com/conda-forge/conda-forge.github.io/issues/1272) sooner...

leofang commented 3 years ago

Verified locally that #84 (and #85 for rc) fixed the problem.