3dem / relion

Image-processing software for cryo-electron microscopy
https://relion.readthedocs.io/en/latest/
GNU General Public License v2.0
444 stars 199 forks source link

3D auto-refine MPI issues #883

Closed Cygnus2015 closed 9 months ago

Cygnus2015 commented 2 years ago

Hi All,

I had this issue during 3D auto refine. The setting was using the number of MPI proc = 3 and threads=2 for the 3D auto-refine. Any suggestions would be appreciated.

Thanks. Park

Environment:

OS: [Ubuntu 20.04.2 LTS]
MPI runtime: [OpenMPI 2.0.1]
RELION version [RELION-4.0-beta-2-commit-ce2e93 
 Memory: [128 GB]
GPU: [GTX 1080Ti]

Dataset:

Box size: [256 px]
Pixel size: [ 0.97 Å/px]
Number of particles: [e.g. 150,000]
Description: [appoferrtin]

Job options:

Type of job: [Refine3D]
Number of MPI processes: [3]
Number of threads: [2]
Full command (see note.txt in the job directory):

which relion_refine_mpi --o Refine3D/job085/run --auto_refine --split_random_halves --i Extract/job056/particles.star --ref Class3D/job049/run_it025_class001_256px.mrc --firstiter_cc --ini_high 60 --dont_combine_weights_via_disc --pool 3 --pad 2 --auto_ignore_angles --auto_resol_angles --ctf --particle_diameter 200 --flatten_solvent --zero_mask --oversampling 1 --healpix_order 2 --auto_local_healpix_order 4 --offset_range 5 --offset_step 2 --sym O --low_resol_join_halves 40 --norm --scale --j 2 --gpu "" --pipeline_control Refine3D/job085/

Error message:

================== ERROR: MlOptimiserMpi::initialiseWorkLoad: at least 3 MPI processes are required when splitting data into random halves

MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them.

in: /home/hoanglab/relion/src/ml_optimiser_mpi.cpp, line 539 ERROR: MlOptimiserMpi::initialiseWorkLoad: at least 3 MPI processes are required when splitting data into random halves === Backtrace === /usr/local/bin/relion_refine_mpi(_ZN11RelionErrorC1ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES7_l+0x7d) [0x55fb830b785d] /usr/local/bin/relion_refine_mpi(+0x4b138) [0x55fb83046138] /usr/local/bin/relion_refine_mpi(_ZN14MlOptimiserMpi10initialiseEv+0xac) [0x55fb830ea62c] /usr/local/bin/relion_refine_mpi(main+0x71) [0x55fb830a2c91] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7fd7093a20b3] /usr/local/bin/relion_refine_mpi(_start+0x2e) [0x55fb830a5fbe]

ERROR: MlOptimiserMpi::initialiseWorkLoad: at least 3 MPI processes are required when splitting data into random halves

MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them.

in: /home/hoanglab/relion/src/ml_optimiser_mpi.cpp, line 539 ERROR: MlOptimiserMpi::initialiseWorkLoad: at least 3 MPI processes are required when splitting data into random halves === Backtrace === /usr/local/bin/relion_refine_mpi(_ZN11RelionErrorC1ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES7_l+0x7d) [0x55fdfcbf285d] /usr/local/bin/relion_refine_mpi(+0x4b138) [0x55fdfcb81138] /usr/local/bin/relion_refine_mpi(_ZN14MlOptimiserMpi10initialiseEv+0xac) [0x55fdfcc2562c] /usr/local/bin/relion_refine_mpi(main+0x71) [0x55fdfcbddc91] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7f31df8c40b3] /usr/local/bin/relion_refine_mpi(_start+0x2e) [0x55fdfcbe0fbe]

ERROR: MlOptimiserMpi::initialiseWorkLoad: at least 3 MPI processes are required when splitting data into random halves

MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them.

=== RELION MPI setup ===

biochem-fan commented 2 years ago

Please use the issue template! Without information on your system, we cannot answer.

It appears that your queue system and/or MPI runtime is not set up properly. Perhaps you are using a different MPI runtime from what you used to compile RELION.

Cygnus2015 commented 2 years ago

I updated it using the issue template. Thanks.

biochem-fan commented 2 years ago

What does which mpirun say?

Does it show the same MPI implementation as the one you used for compilation?

Cygnus2015 commented 2 years ago

it said /home/lab/anaconda3/bin/mpirun

biochem-fan commented 2 years ago

This is suspicious. It is from conda. Check your cmake log file (if you don't have it, recompile RELION from scratch and check what it says). Probably you used a system-wide OpenMPI (e.g. /usr/bin/mpicc) for compilation but are using OpenMPI runtime from conda. This mismatch is the cause.

Cygnus2015 commented 2 years ago

Thanks for suggesting it. I will check this.

Cygnus2015 commented 2 years ago

I was checking CMakeOutput.log file in /home/lab/relion/build/CMakeFiles. As you mentioned, it was using a system-wide OpenMPI.

I would appreciate it if you could suggest anything.

Thanks.

biochem-fan commented 2 years ago

Run RELION without activating any environment of conda (including base).

biochem-fan commented 9 months ago

No response for many months. Closing.