easybuilders / easybuild-easyconfigs

A collection of easyconfig files that describe which software to build using which build options with EasyBuild.
https://easybuild.io
GNU General Public License v2.0
380 stars 703 forks source link

FFTW3 tests issues and potential OpenMPI 3.1.1 bug related to PMIx #8563

Open jordiblasco opened 5 years ago

jordiblasco commented 5 years ago

The tests of FFTW-3.3.8-gompic-2018b.eb are not capable of finding libcuda when compiling it with CUDA-aware OpenMPI. Tested in CentOS Linux release 7.6.1810.

It seems related to the following two issues:

I guess that mpirun needs -x LD_LIBRARY_PATH as it tries to find the libraries in /usr/lib64.

Executing "mpirun -np 3 /dev/shm/FFTW/3.3.8/gompic-2018b/fftw-3.3.8/mpi/mpi-bench -o nthreads=2 --verbose=1   --verify 'obr6x9x4x6' --verify 'ibr6x9x4x6' --verify 'ofr6x9x4x6' --verify 'ifr6x9x4x6' --verify 'obc6x9x4x6' --verify 'ibc6x9x4x6' --verify 'ofc6x9x4x6' --verify 'ifc6x9x4x6'"
--------------------------------------------------------------------------
The library attempted to open the following supporting CUDA libraries,
but each of them failed.  CUDA-aware support is disabled.
libcuda.so.1: cannot open shared object file: No such file or directory
libcuda.dylib: cannot open shared object file: No such file or directory
/usr/lib64/libcuda.so.1: cannot open shared object file: No such file or directory
/usr/lib64/libcuda.dylib: cannot open shared object file: No such file or directory
If you are not interested in CUDA-aware support, then run with
--mca mpi_cuda_support 0 to suppress this message.  If you are interested
in CUDA-aware support, then try setting LD_LIBRARY_PATH to the location
of libcuda.so.1 to get passed this issue.
--------------------------------------------------------------------------

Also, this particular release of OpenMPI is affected by this bug: https://github.com/open-mpi/ompi/issues/5336

[hpcnow01:95851] OPAL ERROR: Error in file pmix2x.c at line 326
[hpcnow01:95851] OPAL ERROR: Error in file pmix2x.c at line 326
[hpcnow01:95851] OPAL ERROR: Error in file pmix2x.c at line 326
[hpcnow01:95851] *** Process received signal ***
[hpcnow01:95851] Signal: Segmentation fault (11)
[hpcnow01:95851] Signal code: Invalid permissions (2)
[hpcnow01:95851] Failing at address: 0x7fecb4021098
[hpcnow01:95851] [ 0] /lib64/libpthread.so.0(+0xf5d0)[0x7fecc42625d0]
[hpcnow01:95851] [ 1] /hpc/easybuild/broadwell/software/OpenMPI/3.1.1-gcccuda-2018b/lib/openmpi/mca_pmix_pmix2x.so(pmix2x_value_unload+0x0)[0x7fecc2ee6630]
[hpcnow01:95851] [ 2] /hpc/easybuild/broadwell/software/OpenMPI/3.1.1-gcccuda-2018b/lib/openmpi/mca_pmix_pmix2x.so(pmix2x_event_hdlr+0x2e4)[0x7fecc2ee7364]
[hpcnow01:95851] [ 3] /hpc/easybuild/broadwell/software/OpenMPI/3.1.1-gcccuda-2018b/lib/openmpi/mca_pmix_pmix2x.so(pmix_invoke_local_event_hdlr+0x325)[0x7fecc2efe675]
[hpcnow01:95851] [ 4] /hpc/easybuild/broadwell/software/OpenMPI/3.1.1-gcccuda-2018b/lib/openmpi/mca_pmix_pmix2x.so(+0x3b10d)[0x7fecc2f0310d]
[hpcnow01:95851] [ 5] /hpc/easybuild/broadwell/software/OpenMPI/3.1.1-gcccuda-2018b/lib/openmpi/mca_pmix_pmix2x.so(+0x3c9e2)[0x7fecc2f049e2]
[hpcnow01:95851] [ 6] /hpc/easybuild/broadwell/software/OpenMPI/3.1.1-gcccuda-2018b/lib/openmpi/mca_pmix_pmix2x.so(pmix_ptl_base_process_msg+0x1ca)[0x7fecc2f6964a]
[hpcnow01:95851] [ 7] /hpc/easybuild/broadwell/software/OpenMPI/3.1.1-gcccuda-2018b/lib/libopen-pal.so.40(opal_libevent2022_event_base_loop+0xd89)[0x7fecc3c2ff69]
[hpcnow01:95851] [ 8] /hpc/easybuild/broadwell/software/OpenMPI/3.1.1-gcccuda-2018b/lib/openmpi/mca_pmix_pmix2x.so(+0x7c1fe)[0x7fecc2f441fe]
[hpcnow01:95851] [ 9] /lib64/libpthread.so.0(+0x7dd5)[0x7fecc425add5]
[hpcnow01:95851] [10] /lib64/libc.so.6(clone+0x6d)[0x7fecc3f83ead]
[hpcnow01:95851] *** End of error message ***
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 2 with PID 0 on node hpcnow01 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
[hpcnow01:95831] 1 more process has sent help message help-mpi-common-cuda.txt / dlopen failed
boegel commented 5 years ago

@akesandgren Thoughts on this one?

akesandgren commented 5 years ago

The CUDA problem above is that you haven't installed the runtime libraries for CUDA, they come from the OS installed nvidia packages. On Ubuntu it's called "libcuda1-418" or whatever number of the nvidia driver is currently being used.

6602 and https://github.com/easybuilders/easybuild-easyblocks/pull/1464 has nothing to do with the runtime environment, it's only a build time thing.build

I.e. to be able to run the tests when building FFTW with gompic you need the CUDA runtime libraries installed or suffer the above warning message from OpenMPI. Or explicitly set OMPI_MCA_mpi_cuda_support=0 in the environment before doing it.

akesandgren commented 5 years ago

As for the OpenMPI problem then yes, if using the internal PMIx in OpenMPI you might suffer from that bug.

We (at HPC2N) always use an external PMIx to have better control of which version is in use, the same goes for UCX.