movie extr particles crashes with mpi

azazellochg commented 7 years ago

Scipion 1.1. Beta-gal mrcs movies from Relion tutorial. After few minutes the protocol is shown as running, but it is not the case (there is a dump file core.***** in projects folder). It crashed both on cluster and on a local machine. When running with threads - everything is fine. I wonder whether anyone else encountered this or this is specific MPI-installation-related?

Log file:

[fmg09:06471] *** Process received signal ***
[fmg09:06471] Signal: Segmentation fault (11)
[fmg09:06471] Signal code: Address not mapped (1)
[fmg09:06471] Failing at address: 0x10
[fmg09:06471] [ 0] /lib64/libpthread.so.0() [0x3655e0f7e0]
[fmg09:06471] [ 1] /net/nfs1/public/EM/OpenMPI/openmpi-1.5.4/build/lib/openmpi/mca_btl_sm.so(+0x22b6) [0x2b716267a2b6]
[fmg09:06471] [ 2] /net/nfs1/public/EM/OpenMPI/openmpi-1.5.4/build/lib/openmpi/mca_pml_ob1.so(+0xded5) [0x2b7161e64ed5]
[fmg09:06471] [ 3] /net/nfs1/public/EM/OpenMPI/openmpi-1.5.4/build/lib/openmpi/mca_pml_ob1.so(+0x61d8) [0x2b7161e5d1d8]
[fmg09:06471] [ 4] /public/EM/OpenMPI/openmpi/lib/libmpi.so.1(MPI_Isend+0x164) [0x2b715f48a604]
[fmg09:06471] [ 5] /lmb/home/gsharov/soft/scipion/software/lib/python2.7/site-packages/mpi4py/MPI.so(+0xc3375) [0x2b715f1f0375]
[fmg09:06471] [ 6] /lmb/home/gsharov/soft/scipion/software/lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x60fb) [0x2b7158c0b0fb]
[fmg09:06471] [ 7] /lmb/home/gsharov/soft/scipion/software/lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x63ae) [0x2b7158c0b3ae]
[fmg09:06471] [ 8] /lmb/home/gsharov/soft/scipion/software/lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x88e) [0x2b7158c0c4ae]
[fmg09:06471] [ 9] /lmb/home/gsharov/soft/scipion/software/lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x565a) [0x2b7158c0a65a]
[fmg09:06471] [10] /lmb/home/gsharov/soft/scipion/software/lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x88e) [0x2b7158c0c4ae]
[fmg09:06471] [11] /lmb/home/gsharov/soft/scipion/software/lib/libpython2.7.so.1.0(+0x77778) [0x2b7158b8a778]
[fmg09:06471] [12] /lmb/home/gsharov/soft/scipion/software/lib/libpython2.7.so.1.0(PyObject_Call+0x53) [0x2b7158b5b1a3]
[fmg09:06471] [13] /lmb/home/gsharov/soft/scipion/software/lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x414a) [0x2b7158c0914a]
[fmg09:06471] [14] /lmb/home/gsharov/soft/scipion/software/lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x88e) [0x2b7158c0c4ae]
[fmg09:06471] [15] /lmb/home/gsharov/soft/scipion/software/lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x565a) [0x2b7158c0a65a]
[fmg09:06471] [16] /lmb/home/gsharov/soft/scipion/software/lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x63ae) [0x2b7158c0b3ae]
[fmg09:06471] [17] /lmb/home/gsharov/soft/scipion/software/lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x88e) [0x2b7158c0c4ae]
[fmg09:06471] [18] /lmb/home/gsharov/soft/scipion/software/lib/libpython2.7.so.1.0(+0x77671) [0x2b7158b8a671]
[fmg09:06471] [19] /lmb/home/gsharov/soft/scipion/software/lib/libpython2.7.so.1.0(PyObject_Call+0x53) [0x2b7158b5b1a3]
[fmg09:06471] [20] /lmb/home/gsharov/soft/scipion/software/lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x414a) [0x2b7158c0914a]
[fmg09:06471] [21] /lmb/home/gsharov/soft/scipion/software/lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x88e) [0x2b7158c0c4ae]
[fmg09:06471] [22] /lmb/home/gsharov/soft/scipion/software/lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x565a) [0x2b7158c0a65a]
[fmg09:06471] [23] /lmb/home/gsharov/soft/scipion/software/lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x88e) [0x2b7158c0c4ae]
[fmg09:06471] [24] /lmb/home/gsharov/soft/scipion/software/lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x565a) [0x2b7158c0a65a]
[fmg09:06471] [25] /lmb/home/gsharov/soft/scipion/software/lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x63ae) [0x2b7158c0b3ae]
[fmg09:06471] [26] /lmb/home/gsharov/soft/scipion/software/lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x63ae) [0x2b7158c0b3ae]
[fmg09:06471] [27] /lmb/home/gsharov/soft/scipion/software/lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x88e) [0x2b7158c0c4ae]
[fmg09:06471] [28] /lmb/home/gsharov/soft/scipion/software/lib/libpython2.7.so.1.0(+0x77671) [0x2b7158b8a671]
[fmg09:06471] [29] /lmb/home/gsharov/soft/scipion/software/lib/libpython2.7.so.1.0(PyObject_Call+0x53) [0x2b7158b5b1a3]
[fmg09:06471] *** End of error message ***
--------------------------------------------------------------------------
mpiexec has exited due to process rank 0 with PID 6459 on
node fmg09.lmb.internal exiting improperly. There are two reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

This may have caused other processes in the application to be
terminated by signals sent by mpiexec (as reported here).
--------------------------------------------------------------------------
Traceback (most recent call last):
  File "/lmb/home/gsharov/soft/scipion/pyworkflow/apps/pw_protocol_run.py", line 49, in <module>
    runProtocolMain(projPath, dbPath, protId)
  File "/lmb/home/gsharov/soft/scipion/pyworkflow/protocol/protocol.py", line 1739, in runProtocolMain
    hostConfig=hostConfig)
  File "/lmb/home/gsharov/soft/scipion/pyworkflow/utils/process.py", line 51, in runJob
    return runCommand(command, env, cwd)
  File "/lmb/home/gsharov/soft/scipion/pyworkflow/utils/process.py", line 65, in runCommand
    check_call(command, shell=True, stdout=sys.stdout, stderr=sys.stderr, env=env, cwd=cwd)
  File "/lmb/home/gsharov/soft/scipion/software/lib/python2.7/subprocess.py", line 540, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'mpiexec -mca orte_forward_job_control 1 -n 12  `which /lmb/home/gsharov/soft/scipion/scipion` "runprotocol" "pw_protocol_mpirun.py" "/lmb/home/gsharov/beegfs2/ScipionUserData/projects/relion_tutorial" "Runs/011197_XmippProtExtractMovieParticles/logs/run.db" "11197"' returned non-zero exit status 139

delarosatrevin commented 7 years ago

Can you check if other protocols that use MPI for the parallelization of steps (STEPS_PARALLEL) also fail in these systems? (e.g, movie alignment or CTF estimation)

azazellochg commented 7 years ago

@delarosatrevin both ctffind4 and unblur seems to work with both threads/mpi locally/on cluster

delarosatrevin commented 7 years ago

Hum...that's more weird

azazellochg commented 6 years ago

Now it crushes also with many thread or MPIs, locally or on cluster. The weird thing is it works with 4 threads, as soon as I use 6-8 or more, it fails in the middle of the run. The process just dies, no error or anything. I will try to debug it further when I find time.

azazellochg commented 6 years ago

could be related to:

From your error log it seems that there is a bug in this Xmipp protocol in line:

00095: File "/beebylab/software/scipion/1.2/scipion/pyworkflow/em/packages/xmipp3/protocol_extract_particles_movies.py", line 434, in _filterMovie 00096: micrograph = micSet[movieId]

Where the micSet (internally a .sqlite database) is accessed from multiple threads. I think this should be fixed by removing this query in the way it is now. In the meantime, as David suggested, you could try to run with only 1 processor, but not sure how long it will take.

I'm sorry for this issue. Best, Jose Miguel

delarosatrevin commented 6 years ago

Is there any part in the code that tries to grab an item from the input set? I'm not sure if the error is the same, the one here seems more related to MPI and there is not Sqlite error over there...anyway...just a guess.

albertmena commented 2 years ago

Outdated. Protocol ExtractMovieParticles will be deprecated

I2PC / scipion-em-xmipp

movie extr particles crashes with mpi #253