3dem / relion

Image-processing software for cryo-electron microscopy
https://relion.readthedocs.io/en/latest/
GNU General Public License v2.0
440 stars 194 forks source link

MPI initialization error #118

Closed bforsbe closed 6 years ago

bforsbe commented 7 years ago

Originally reported by: Abhiram Chintangal (Bitbucket: achintangal, GitHub: achintangal)


When trying to run jobs using more then 1 CPU (relion_refine_mpi) I get the following error:

ompi_mpi_init: orte_init failed --> Returned "Error" (-1) instead of "Success" (0)

An error occurred in MPI_Init on a NULL communicator *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort

[poissons:14318] Local abort before MPI_INIT completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed!

For completeness this is the command I am trying to run (note.txt file):

++++ Executing new job on Tue Oct 4 14:54:49 2016 ++++ with the following command(s): which relion_refine_mpi --o Class3D/job046/run --i Extract/job041/particles.star --ref Import/refs.star --firstiter_cc --ini_high 60 --dont_combine_weights_via_disc --pool 3 --ctf --iter 30 --tau2_fudge 4 --particle_diameter 300 --K 2 --flatten_solvent --zero_mask --strict_highres_exp 12 --oversampling 1 --healpix_order 2 --offset_range 3 --offset_step 2 --sym C1 --norm --scale --j 1 --gpu 0,1 ++++

If I run the same job with only 1 CPU with (relion_refine) I can get it to run.


bforsbe commented 7 years ago

Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe):


No worries, it's good to have documented symptoms that we know the reason for, even if it isn't a problem with relion.

@achintangal Did your issue work out as well?

bforsbe commented 7 years ago

Original comment by Matthew Belousoff (Bitbucket: mbelouso, GitHub: Unknown):


Bjoern,

So I managed to fix it. It was a problem with the MPI on the computer and a fresh installation of openMPI fixed it. Sorry to waste your time.

bforsbe commented 7 years ago

Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe):


@mbelouso The only differences I can see between the 2.0.1 version and earlier versions (with regards to mpi) is associated with --scratch_dir. What happens when you omit that?

Also, not knowing if you already did so, I would make a fresh build-directory to make sure you don't have any old references or built libraries messing with you new 2.0.1 build;

#!bash

mkdir build_201
cd build_201
cmake ..
make -j8

As your output also explicitly states, this really indicates an error in MPI, but if you can verify two versions of RELION (like 2.0.1 and 2.0.b12) that you compiled in different directories on the same machine using the same session and settings, and get an error from one and not the other, I will dig into sorting out what it is we are doing that is causing it.

bforsbe commented 7 years ago

Original comment by Matthew Belousoff (Bitbucket: mbelouso, GitHub: Unknown):


I have the same issue. The 'alpha' versions were working fine, but as soon as I built the open beta version this is the errors that I get (see run.err readout below). There is nothing weird when I run cmake, it finds openMPI with no dramas, and it doesn't matter if I run it with mpirun from the command line, the same problem exists.

Command:

#!command

mpirun -n 7 `which relion_refine_mpi` --o Class2D/job019/run --i Select/1stCut/particles.star --dont_combine_weights_via_disc --scratch_dir /home/scratch --pool 30 --ctf  --iter 20 --tau2_fudge 2 --particle_diameter 320 --K 20 --flatten_solvent  --zero_mask  --oversampling 1 --psi_step 10 --offset_range 5 --offset_step 2 --norm --scale  --j 1 --gpu  --dont_check_norm
#!error

--------------------------------------------------------------------------
A requested component was not found, or was unable to be opened.  This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded).  Note that
Open MPI stopped checking at the first component that it did not find.

Host:      LithgowServer
Framework: ess
Component: pmi
--------------------------------------------------------------------------
[LithgowServer:46568] [[INVALID],INVALID] ORTE_ERROR_LOG: Error in file runtime/orte_init.c at line 116
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_ess_base_open failed
  --> Returned value Error (-1) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: orte_init failed
  --> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[LithgowServer:46568] *** An error occurred in MPI_Init
[LithgowServer:46568] *** on a NULL communicator
[LithgowServer:46568] *** Unknown error
[LithgowServer:46568] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
--------------------------------------------------------------------------
An MPI process is aborting at a time when it cannot guarantee that all
of its peer processes in the job will be killed properly.  You should
double check that everything has shut down cleanly.

  Reason:     Before MPI_INIT completed
  Local host: LithgowServer
  PID:        46568
--------------------------------------------------------------------------
--------------------------------------------------------------------------
A requested component was not found, or was unable to be opened.  This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded).  Note that
Open MPI stopped checking at the first component that it did not find.

Host:      LithgowServer
Framework: ess
Component: pmi
--------------------------------------------------------------------------
[LithgowServer:46569] [[INVALID],INVALID] ORTE_ERROR_LOG: Error in file runtime/orte_init.c at line 116
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_ess_base_open failed
  --> Returned value Error (-1) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: orte_init failed
  --> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[LithgowServer:46569] *** An error occurred in MPI_Init
[LithgowServer:46569] *** on a NULL communicator
[LithgowServer:46569] *** Unknown error
[LithgowServer:46569] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
--------------------------------------------------------------------------
An MPI process is aborting at a time when it cannot guarantee that all
of its peer processes in the job will be killed properly.  You should
double check that everything has shut down cleanly.

  Reason:     Before MPI_INIT completed
  Local host: LithgowServer
  PID:        46569
--------------------------------------------------------------------------
bforsbe commented 7 years ago

Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe):


Sounds like an issue with MPI specifically. Try executing directly on the commandline by e.g.

mpirun -n 3 relion_refine_mpi ..... --j 1

and check the cmake config step of your installation for clues with respect to detection of mpi versions.

HamishGBrown commented 3 years ago

Just had this issue too (relion 3.1.1) and my issue turned out to be that EMAN2 happens to ship with its own version of mpirun and this was on my PATH. Commenting out the EMAN2 lines in my .bashrc file was enough to get the relion functions working again.

Only commenting since @bforsbe points out that:

Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe):

No worries, it's good to have documented symptoms that we know the reason for, even if it isn't a problem with relion.

I'll need to cogitate about how to get my relion and EMAN2 installs to live happily next to each other...