3dem / relion

Image-processing software for cryo-electron microscopy
https://relion.readthedocs.io/en/latest/
GNU General Public License v2.0
445 stars 200 forks source link

MultiBody refinement MPI message truncated Error - relion 5.0 #1052

Open AssmannG opened 9 months ago

AssmannG commented 9 months ago

Dear All,

Describe your problem

A User of our HPC setup runs into this problem (I reproduced the error): MultiBody refinement runs up to ~ it 15 and then fails with an MPI_ERR_TRUNCATE .

As described in Issue #669 already, I tried to run with the suggestion " Combine iterations through disc?: Yes in the Compute tab" , but did not help, failed at it 16 and caused the node to crash.

Environment:

Job options:

Error message:

  3: MPI_ERR_TRUNCATE: message truncated
  3: MPI_ERR_TRUNCATE: message truncated
in: /var/tmp/assman_g/relion-5.0-beta/src/src/mpi.cpp, line 495
ERROR: 
Encountered an MPI-related error, see above. Now exiting...
terminate called after throwing an instance of 'RelionError'

relion_refine_mpi:27674 terminated with signal 6 at PC=2b6ae17d1387 SP=7ffde7459eb8.  Backtrace:
/usr/lib64/libc.so.6(gsignal+0x37)[0x2b6ae17d1387]
/usr/lib64/libc.so.6(abort+0x148)[0x2b6ae17d2a78]
/opt/psi/Programming/gcc/10.4.0/lib64/libstdc++.so.6(+0x995ec)[0x2b6ae0d0e5ec]
/opt/psi/Programming/gcc/10.4.0/lib64/libstdc++.so.6(+0xa4806)[0x2b6ae0d19806]
/opt/psi/Programming/gcc/10.4.0/lib64/libstdc++.so.6(+0xa4871)[0x2b6ae0d19871]
/opt/psi/Programming/gcc/10.4.0/lib64/libstdc++.so.6(+0xa4b04)[0x2b6ae0d19b04]
/opt/psi/EM/relion/5.0-beta/bin/relion_refine_mpi[0x44db76]
srun: error: merlin-g-009: task 3: Exited with exit code 1
biochem-fan commented 9 months ago

Because I could not reproduce this issue, it is very difficult to debug. Potential workarounds:

You can combine both.

AssmannG commented 9 months ago

Thanks, I will try!