3dem / relion

Image-processing software for cryo-electron microscopy
https://relion.readthedocs.io/en/latest/
GNU General Public License v2.0
444 stars 199 forks source link

Relion-5.0 MPI issue #1004

Closed CFDavidHou closed 10 months ago

CFDavidHou commented 10 months ago

Describe your problem

Hi everyone, I just freshly setup a new workstation and installed Relion-5.0. However, this is the first time I encountered MPI issue that instead of using "Number of MPI processes" it runs 3 separate "RELION MPI setup" in one job.

DO NOT cross post a same question to multiple issues and/or many mailing lists (CCPEM, 3DEM, etc).

Environment:

Dataset: Tutorial dataset

Job options:

Error message:

run.out

RELION version: 5.0-beta-0-commit-739650 
Precision: BASE=double, CUDA-ACC=single 

 === RELION MPI setup ===
 + Number of MPI processes             = 1
 + Number of threads per MPI process   = 4
 + Total number of threads therefore   = 4
 + Leader  (0) runs on host            = xxx
 =================
RELION version: 5.0-beta-0-commit-739650 
Precision: BASE=double, CUDA-ACC=single 

 === RELION MPI setup ===
 + Number of MPI processes             = 1
 + Number of threads per MPI process   = 4
 + Total number of threads therefore   = 4
 + Leader  (0) runs on host            = xxx
 =================
RELION version: 5.0-beta-0-commit-739650 
Precision: BASE=double, CUDA-ACC=single 

 === RELION MPI setup ===
 + Number of MPI processes             = 1
 + Number of threads per MPI process   = 4
 + Total number of threads therefore   = 4
 + Leader  (0) runs on host            = xxx
 =================
 Running CPU instructions in double precision. 
 Running CPU instructions in double precision. 
 Running CPU instructions in double precision. 

 RELION version: 5.0-beta-0-commit-739650
 exiting with an error ...

 RELION version: 5.0-beta-0-commit-739650
 exiting with an error ...

 RELION version: 5.0-beta-0-commit-739650
 exiting with an error ...

run.err

in: /home/user/relion/src/ml_optimiser_mpi.cpp, line 788
ERROR: 
MlOptimiserMpi::initialiseWorkLoad: at least 3 MPI processes are required when splitting data into random halves
=== Backtrace  ===
/home/user/relion/build/bin//relion_refine_mpi(_ZN11RelionErrorC1ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES7_l+0x7d) [0x5592dd668bed]
/home/user/relion/build/bin//relion_refine_mpi(+0x4cb70) [0x5592dd5f0b70]
/home/user/relion/build/bin//relion_refine_mpi(_ZN14MlOptimiserMpi10initialiseEv+0xc6) [0x5592dd69b876]
/home/user/relion/build/bin//relion_refine_mpi(main+0x7d) [0x5592dd656c9d]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7f6e5ca59083]
/home/user/relion/build/bin//relion_refine_mpi(_start+0x2e) [0x5592dd65a98e]
==================
ERROR: 
MlOptimiserMpi::initialiseWorkLoad: at least 3 MPI processes are required when splitting data into random halves
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
in: /home/user/relion/src/ml_optimiser_mpi.cpp, line 788
ERROR: 
MlOptimiserMpi::initialiseWorkLoad: at least 3 MPI processes are required when splitting data into random halves
=== Backtrace  ===
/home/user/relion/build/bin//relion_refine_mpi(_ZN11RelionErrorC1ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES7_l+0x7d) [0x56257770dbed]
/home/user/relion/build/bin//relion_refine_mpi(+0x4cb70) [0x562577695b70]
/home/user/relion/build/bin//relion_refine_mpi(_ZN14MlOptimiserMpi10initialiseEv+0xc6) [0x562577740876]
/home/user/relion/build/bin//relion_refine_mpi(main+0x7d) [0x5625776fbc9d]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7f69b93cd083]
/home/user/relion/build/bin//relion_refine_mpi(_start+0x2e) [0x5625776ff98e]
==================
ERROR: 
MlOptimiserMpi::initialiseWorkLoad: at least 3 MPI processes are required when splitting data into random halves
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
in: /home/user/relion/src/ml_optimiser_mpi.cpp, line 788
ERROR: 
MlOptimiserMpi::initialiseWorkLoad: at least 3 MPI processes are required when splitting data into random halves
=== Backtrace  ===
/home/user/relion/build/bin//relion_refine_mpi(_ZN11RelionErrorC1ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES7_l+0x7d) [0x56343e98dbed]
/home/user/relion/build/bin//relion_refine_mpi(+0x4cb70) [0x56343e915b70]
/home/user/relion/build/bin//relion_refine_mpi(_ZN14MlOptimiserMpi10initialiseEv+0xc6) [0x56343e9c0876]
/home/user/relion/build/bin//relion_refine_mpi(main+0x7d) [0x56343e97bc9d]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7f6f830c5083]
/home/user/relion/build/bin//relion_refine_mpi(_start+0x2e) [0x56343e97f98e]
==================
ERROR: 
MlOptimiserMpi::initialiseWorkLoad: at least 3 MPI processes are required when splitting data into random halves
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------

Any insights is appreciated!

Best,

David

biochem-fan commented 10 months ago

You are using a different MPI runtime from the one RELION is linked against. Check which mpirun and make sure it matches what cmake found.

A typical cause is that you compiled RELION while CCPEM/PHENIX/conda etc was sourced and MPI from it was picked up by cmake, while you are running RELION when they are not sourced.