3dem / relion

Image-processing software for cryo-electron microscopy
https://relion.readthedocs.io/en/latest/
GNU General Public License v2.0
453 stars 202 forks source link

Problem with Relion version 4.0-beta-2 and MPI #918

Closed olofsvensson closed 2 years ago

olofsvensson commented 2 years ago

Hi,

I have compiled and installed version 4.0-beta-2 of Relion using these commands:

mkdir relion-4.0
cd relion-4.0
git clone https://github.com/3dem/relion.git
cd relion
git checkout ver4.0
mkdir build
cd build
cmake .. -DCMAKE_C_COMPILER=gcc-8 -DCMAKE_CXX_COMPILER=g++-8 
make -j 16
cmake -DCMAKE_INSTALL_PREFIX=/opt/pxsoft/relion/v4.0-beta-2/ubuntu20.04 ..
make -j 16
make install

The Relion GUI starts ok, however, when trying to do a 3D classification job I get the following error:

libibverbs: Warning: couldn't load driver 'libmlx4-rdmav34.so': libmlx4-rdmav34.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'libmlx4-rdmav34.so': libmlx4-rdmav34.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'librxe-rdmav34.so': librxe-rdmav34.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'librxe-rdmav34.so': librxe-rdmav34.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'libmlx4-rdmav34.so': libmlx4-rdmav34.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'librxe-rdmav34.so': librxe-rdmav34.so: cannot open shared object file: No such file or directory
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

  Local host:           dgx01
  Local device:         mlx5_3
  Local port:           1
  CPCs attempted:       udcm
--------------------------------------------------------------------------
[dgx01:2080916] 2 more processes have sent help message help-mpi-btl-openib-cpc-base.txt / no cpcs for port
[dgx01:2080916] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[dgx01:2081012] *** Process received signal ***
[dgx01:2081012] Signal: Segmentation fault (11)
[dgx01:2081012] Signal code: Address not mapped (1)
[dgx01:2081012] Failing at address: 0xffffffffffffff90
[dgx01:2081012] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0)[0x7fd61d2ec3c0]
[dgx01:2081012] [ 1] /usr/lib/x86_64-linux-gnu/libibverbs.so.1(ibv_dereg_mr+0x16)[0x7fd61ac6bf96]
[dgx01:2081012] [ 2] /usr/lib/x86_64-linux-gnu/libfabric.so.1(+0x51732)[0x7fd61a45b732]
[dgx01:2081012] [ 3] /usr/lib/x86_64-linux-gnu/libfabric.so.1(+0x734e1)[0x7fd61a47d4e1]
[dgx01:2081012] [ 4] /usr/lib/x86_64-linux-gnu/libfabric.so.1(+0x735d1)[0x7fd61a47d5d1]
[dgx01:2081012] [ 5] /usr/lib/x86_64-linux-gnu/libfabric.so.1(+0x770b5)[0x7fd61a4810b5]
[dgx01:2081012] [ 6] /usr/lib/x86_64-linux-gnu/libfabric.so.1(+0x72ca4)[0x7fd61a47cca4]
[dgx01:2081012] [ 7] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_mtl_ofi.so(+0x4c8e)[0x7fd61a5c3c8e]
[dgx01:2081012] [ 8] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_pml_cm.so(+0x2958)[0x7fd61aca5958]
[dgx01:2081012] [ 9] /usr/lib/x86_64-linux-gnu/libmpi.so.40(ompi_coll_base_bcast_intra_generic+0x3c0)[0x7fd61d623c50]
[dgx01:2081012] [10] /usr/lib/x86_64-linux-gnu/libmpi.so.40(ompi_coll_base_bcast_intra_pipeline+0xd1)[0x7fd61d624061]
[dgx01:2081012] [11] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_coll_tuned.so(ompi_coll_tuned_bcast_intra_dec_fixed+0x12e)[0x7fd61a35ddae]
[dgx01:2081012] [12] /usr/lib/x86_64-linux-gnu/libmpi.so.40(MPI_Bcast+0x120)[0x7fd61d5e6b10]
[dgx01:2081012] [13] /opt/pxsoft/relion/v4.0-beta-2/ubuntu20.04/bin/relion_refine_mpi(_ZN7MpiNode16relion_MPI_BcastEPvlP15ompi_datatype_tiP19ompi_communicator_t+0x176)[0x5563665fd9d6]
[dgx01:2081012] [14] /opt/pxsoft/relion/v4.0-beta-2/ubuntu20.04/bin/relion_refine_mpi(_ZN14MlOptimiserMpi10initialiseEv+0x178)[0x5563665e4928]
[dgx01:2081012] [15] /opt/pxsoft/relion/v4.0-beta-2/ubuntu20.04/bin/relion_refine_mpi(main+0x71)[0x55636659cfb1]
[dgx01:2081012] [16] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7fd61cd690b3]
[dgx01:2081012] [17] /opt/pxsoft/relion/v4.0-beta-2/ubuntu20.04/bin/relion_refine_mpi(_start+0x2e)[0x5563665a01ee]
[dgx01:2081012] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 2 with PID 0 on node dgx01 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

Here are some more output:

RELION version: 4.0-beta-2-commit-3b1752 
Precision: BASE=double

 === RELION MPI setup ===
 + Number of MPI processes             = 3
 + Leader  (0) runs on host            = dgx01
 + Follower     1 runs on host            = dgx01
 + Follower     2 runs on host            = dgx01
 =================
 uniqueHost dgx01 has 2 ranks.
 Using explicit indexing on follower 1 to assign devices  6
 Thread 0 on follower 1 mapped to device 6
 Using explicit indexing on follower 2 to assign devices  7
 Thread 0 on follower 2 mapped to device 7
 Running CPU instructions in double precision. 
 Estimating initial noise spectra 
000/??? sec ~~(,_,">                                                          [oo]
0.03/2.02 min ~~(,_,">
0.08/2.52 min .~~(,_,">
...
1.65/1.67 min ...........................................................~~(,_,">
1.65/1.65 min ............................................................~~(,_,">

and

`which relion_refine_mpi` --o Class3D/job212/run --i Refine3D/job182/run_ct26_data.star --ref Refine3D/job182/run_ct26_class001.mrc --ini_high 40 --dont_combine_weights_via_disc --pool 3 --pad 2  --skip_gridding  --ctf --ctf_intact_first_peak --iter 25 --tau2_fudge 4 --particle_diameter 600 --K 3 --flatten_solvent --zero_mask --oversampling 1 --healpix_order 2 --offset_range 5 --offset_step 2 --sym C14 --norm --scale  --helix --helical_inner_diameter 390 --helical_outer_diameter 520 --ignore_helical_symmetry --helical_keep_tilt_prior_fixed --sigma_tilt 5 --sigma_psi 3.33333 --sigma_rot 0 --j 1 --gpu "6:7"  --pipeline_control Class3D/job212/

Any help is appreciated!

Regards,

Olof

biochem-fan commented 2 years ago

I don't think this is RELION's problem. This is probably an MPI issue.

Is this on an HPC with Infiniband?

Why are you changing the compiler with cmake .. -DCMAKE_C_COMPILER=gcc-8 -DCMAKE_CXX_COMPILER=g++-8?

Which MPI runtime are you using? Is it the same as the one used during compilation?

olofsvensson commented 2 years ago

Problem finally solved with a change of the version of MPI so indeed nothing related to Relion.

For anyone who might run into the same problem: We were trying to run Relion on a DGX-1 computer running Ubuntu 20.04. The default version of "mpirun" on Ubuntu 20.04 is OpenMPI 4.0.3. Switching to OpenMPI 4.1.2a1 solved the problem - Relion now runs fine with MPI on this computer.