3dem / relion

Image-processing software for cryo-electron microscopy
https://relion.readthedocs.io/en/latest/
GNU General Public License v2.0
453 stars 202 forks source link

3D classification and 3D autoRefine segmentation faults #647

Closed shahpnmlab closed 4 years ago

shahpnmlab commented 4 years ago

Describe your problem

I am trying to run 3D classification and refinements on sub volumes using relion 3.0.7 compiled for use on our CPU based cluster. I have extracted the volumes in 64px,128px and 256px boxes. I am able to run 3D classification on the 64px xub-volumes but not on the 128,256px boxes and 3D auto-refine fails on all 3 boxes. I have modified the particles.star file for the 64px boxes to include stray 4 particles in a single group (refer to the error message and the suggested fix, below). But despite my best efforts I am unable to execute these processes successfully.

Environment:

Dataset:

Job options:

3D classification with the 128px/256px sub-volumes

 mpirun -np 33 `which relion_refine_mpi` --o Class3D/job011/run --i Import/job007/allParticles128px.star --ref Import/job008/initModel128.mrc --firstiter_cc --ini_high 50 --dont_combine_weights_via_disc --pool 1 --pad 2  --ctf --iter 25 --tau2_fudge 4 --particle_diameter 915 --K 4 --flatten_solvent --zero_mask --oversampling 1 --healpix_order 2 --offset_range 5 --offset_step 2 --sym I1 --norm --scale  --j 4

Error message:

Please cite the full error message as the example below.

-catch_rsh -catch_hostname /var/spool/default/compc074/active_jobs/27397367.1/pe_hostfile
compc074
compc074
compc074
compc074
compc075
compc075
compc075
compc075
compc081
compc081
compc081
compc081
compc093
compc093
compc093
compc093
compc076
compc076
compc076
compc076
compc088
compc088
compc088
compc088
compc086
compc086
compc086
compc086
compc082
compc082
compc082
RELION version: 3.0.7
Precision: BASE=double, VECTOR-ACC=single

 === RELION MPI setup ===
 + Number of MPI processes             = 31
 + Master  (0) runs on host            = compc074
 + Slave     1 runs on host            = compc074
 + Slave     2 runs on host            = compc074
 + Slave     3 runs on host            = compc074.
 + Slave     7 runs on host            = compc075.
 + Slave     4 runs on host            = compc075
 + Slave     9 runs on host            = compc081.
 + Slave     5 runs on host            = compc075.
 + Slave    10 runs on host            = compc081
 + Slave     6 runs on host            = compc075.
 + Slave    11 runs on host            = compc081.
 + Slave     8 runs on host            = compc081.
 =================
 + Slave    12 runs on host            = compc093.
 + Slave    15 runs on host            = compc093.
 + Slave    28 runs on host            = compc082.
 + Slave    13 runs on host            = compc093.
 + Slave    14 runs on host            = compc093.
 + Slave    16 runs on host            = compc076
 + Slave    29 runs on host            = compc082.
 + Slave    26 runs on host            = compc086.
 + Slave    17 runs on host            = compc076.
 + Slave    30 runs on host            = compc082.
 + Slave    27 runs on host            = compc086
 + Slave    20 runs on host            = compc088
 + Slave    18 runs on host            = compc076.
 + Slave    24 runs on host            = compc086.
 + Slave    21 runs on host            = compc088.
 + Slave    19 runs on host            = compc076
 + Slave    25 runs on host            = compc086
 + Slave    22 runs on host            = compc088.
 + Slave    23 runs on host            = compc088
 Running CPU instructions in double precision.
 Estimating initial noise spectra
  12/  12 sec ............................................................~~(,_,">
WARNING: There are only 4 particles in group 2
WARNING: You may want to consider joining some micrographs into larger groups to obtain more robust noise estimates.
         You can do so by using the same rlnMicrographName for particles from multiple different micrographs in the input STAR file.
         It is then best to join micrographs with similar defocus values and similar apparent signal-to-noise ratios.
 CurrentResolution= 50.5263 Angstroms, which requires orientationSampling of at least 6.31579 degrees for a particle of diameter 915 Angstroms
 Oversampling= 0 NrHiddenVariableSamplingPoints= 23328
 OrientationalSampling= 15 NrOrientations= 72
 TranslationalSampling= 2 NrTranslations= 81
=============================
 Oversampling= 1 NrHiddenVariableSamplingPoints= 1492992
 OrientationalSampling= 7.5 NrOrientations= 576
 TranslationalSampling= 1 NrTranslations= 648
=============================
 Expectation iteration 1 of 25
  22/  22 sec ............................................................~~(,_,">....~~(,_,">
 Maximization ...
  19/  19 sec ............................................................~~(,_,">
 Estimating accuracies in the orientational assignment ...
   9/   9 sec ............................................................~~(,_,">
 Auto-refine: Estimated accuracy angles= 0.05 degrees; offsets= 0.1 pixels
 CurrentResolution= 50.5263 Angstroms, which requires orientationSampling of at least 6.31579 degrees for a particle of diameter 915 Angstroms
 Oversampling= 0 NrHiddenVariableSamplingPoints= 23328
 OrientationalSampling= 15 NrOrientations= 72
 TranslationalSampling= 2 NrTranslations= 81
=============================
 Oversampling= 1 NrHiddenVariableSamplingPoints= 1492992
 OrientationalSampling= 7.5 NrOrientations= 576
 TranslationalSampling= 1 NrTranslations= 648
=============================
 Expectation iteration 2 of 25
  38/  38 sec ............................................................~~(,_,">....~~(,_,">
 Maximization ...
  28/  28 sec ............................................................~~(,_,">
 Estimating accuracies in the orientational assignment ...
   8/   8 sec ............................................................~~(,_,">
 Auto-refine: Estimated accuracy angles= 0.05 degrees; offsets= 0.1 pixels
 CurrentResolution= 45.7143 Angstroms, which requires orientationSampling of at least 5.71429 degrees for a particle of diameter 915 Angstroms
 Oversampling= 0 NrHiddenVariableSamplingPoints= 23328
 OrientationalSampling= 15 NrOrientations= 72
 TranslationalSampling= 2 NrTranslations= 81
=============================
 Oversampling= 1 NrHiddenVariableSamplingPoints= 1492992
 OrientationalSampling= 7.5 NrOrientations= 576
 TranslationalSampling= 1 NrTranslations= 648
=============================
 Expectation iteration 3 of 25
2.85/2.85 min ............................................................~~(,_,">....~~(,_,">
 Maximization ...
  27/  27 sec ............................................................~~(,_,">
 Estimating accuracies in the orientational assignment ...
   8/   8 sec ............................................................~~(,_,">
 Auto-refine: Estimated accuracy angles= 0.05 degrees; offsets= 0.1 pixels
 CurrentResolution= 36.9231 Angstroms, which requires orientationSampling of at least 4.61538 degrees for a particle of diameter 915 Angstroms
 Oversampling= 0 NrHiddenVariableSamplingPoints= 23328
 OrientationalSampling= 15 NrOrientations= 72
 TranslationalSampling= 2 NrTranslations= 81
=============================
 Oversampling= 1 NrHiddenVariableSamplingPoints= 1492992
 OrientationalSampling= 7.5 NrOrientations= 576
 TranslationalSampling= 1 NrTranslations= 648
=============================
 Expectation iteration 4 of 25
1.95/1.95 min ............................................................~~(,_,">....~~(,_,">
 Maximization ...
  28/  28 sec ............................................................~~(,_,">
 Estimating accuracies in the orientational assignment ...
   9/   9 sec ............................................................~~(,_,">
 Auto-refine: Estimated accuracy angles= 0.05 degrees; offsets= 0.1 pixels
 CurrentResolution= 33.1034 Angstroms, which requires orientationSampling of at least 4.13793 degrees for a particle of diameter 915 Angstroms
 Oversampling= 0 NrHiddenVariableSamplingPoints= 23328
 OrientationalSampling= 15 NrOrientations= 72
 TranslationalSampling= 2 NrTranslations= 81
=============================
 Oversampling= 1 NrHiddenVariableSamplingPoints= 1492992
 OrientationalSampling= 7.5 NrOrientations= 576
 TranslationalSampling= 1 NrTranslations= 648
=============================
 Expectation iteration 5 of 25
2.15/2.15 min ............................................................~~(,_,">....~~(,_,">
 Maximization ...
  33/  33 sec ............................................................~~(,_,">
 Estimating accuracies in the orientational assignment ...
  10/  10 sec ............................................................~~(,_,">
 Auto-refine: Estimated accuracy angles= 0.05 degrees; offsets= 0.1 pixels
 CurrentResolution= 25.2632 Angstroms, which requires orientationSampling of at least 3.15789 degrees for a particle of diameter 915 Angstroms
 Oversampling= 0 NrHiddenVariableSamplingPoints= 23328
 OrientationalSampling= 15 NrOrientations= 72
 TranslationalSampling= 2 NrTranslations= 81
=============================
 Oversampling= 1 NrHiddenVariableSamplingPoints= 1492992
 OrientationalSampling= 7.5 NrOrientations= 576
 TranslationalSampling= 1 NrTranslations= 648
=============================
 Expectation iteration 6 of 25
3.80/3.80 min ............................................................~~(,_,">....~~(,_,">
 Maximization ...
  53/  53 sec ............................................................~~(,_,">
 Estimating accuracies in the orientational assignment ...
  10/  10 sec ............................................................~~(,_,">
 Auto-refine: Estimated accuracy angles= 0.05 degrees; offsets= 0.1 pixels
 CurrentResolution= 24.6154 Angstroms, which requires orientationSampling of at least 3.07692 degrees for a particle of diameter 915 Angstroms
 Oversampling= 0 NrHiddenVariableSamplingPoints= 23328
 OrientationalSampling= 15 NrOrientations= 72
 TranslationalSampling= 2 NrTranslations= 81
=============================
 Oversampling= 1 NrHiddenVariableSamplingPoints= 1492992
 OrientationalSampling= 7.5 NrOrientations= 576
 TranslationalSampling= 1 NrTranslations= 648
=============================
 Expectation iteration 7 of 25
3.65/3.65 min ............................................................~~(,_,">....~~(,_,">
 Maximization ...
  57/  57 sec ............................................................~~(,_,">
 Estimating accuracies in the orientational assignment ...
   3/  13 sec .............~~(,_,">[compc074:18051] *** Process received signal ***
[compc074:18051] Signal: Segmentation fault (11)
[compc074:18051] Signal code: Address not mapped (1)
[compc074:18051] Failing at address: 0x2094000
[compc074:18051] [ 0] /lib64/libpthread.so.0(+0xf5d0)[0x2b7d4be4b5d0]
[compc074:18051] [ 1] relion/3.0.7-gcc5.4.0-skylake-libtiff4.0.10/bin/relion_refine_mpi(_ZN11MlOptimiser30calculateExpectedAngularErrorsEll+0xb09)[0x5f68b9]
[compc074:18051] [ 2] relion/3.0.7-gcc5.4.0-skylake-libtiff4.0.10/bin/relion_refine_mpi(_ZN14MlOptimiserMpi11expectationEv+0x27c1)[0x45f591]
[compc074:18051] [ 3] relion/3.0.7-gcc5.4.0-skylake-libtiff4.0.10/bin/relion_refine_mpi(_ZN14MlOptimiserMpi7iterateEv+0xba)[0x46dc5a]
[compc074:18051] [ 4] relion/3.0.7-gcc5.4.0-skylake-libtiff4.0.10/bin/relion_refine_mpi(main+0x78)[0x4334e8]
[compc074:18051] [ 5] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2b7d4c07a3d5]
[compc074:18051] [ 6] relion/3.0.7-gcc5.4.0-skylake-libtiff4.0.10/bin/relion_refine_mpi[0x433f7f]
[compc074:18051] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 18051 on node compc074 exited on signal 11 (Segmentation fault).
------------------------------------------------------
--------------------------------------------------------------------------

3D auto refine with 64px box

mpirun -np 33 `which relion_refine_mpi` --o Refine3D/job016/run --auto_refine --split_random_halves --i Select/job013/particles.star --ref Class3D/job006/run_it025_class001.mrc --ini_high 50 --dont_combine_weights_via_disc --pool 3 --pad 2  --ctf --ctf_corrected_ref --particle_diameter 920 --flatten_solvent --zero_mask --oversampling 1 --healpix_order 2 --auto_local_healpix_order 4 --offset_range 5 --offset_step 2 --sym I1 --low_resol_join_halves 40 --norm --scale  --j 4

(i have tried with 128cores and 4 threads and it is the same error.

-catch_rsh -catch_hostname /var/spool/default/compc104/active_jobs/27398313.1/pe_hostfile
compc104
compc104
compc104
compc104
compc078
compc078
compc078
compc078
compc074
compc074
compc074
compc074
compc087
compc087
compc087
compc087
compc090
compc090
compc090
compc090
compc093
compc093
compc093
compc093
compc084
compc084
compc084
compc084
compc076
compc076
compc076
compc076
compc077
RELION version: 3.0.7
Precision: BASE=double, VECTOR-ACC=single

 === RELION MPI setup ===
 + Number of MPI processes             = 33
 + Master  (0) runs on host            = compc104.
 + Slave     1 runs on host            = compc104.
 + Slave     2 runs on host            = compc104.
 + Slave     3 runs on host            = compc104.
 + Slave    12 runs on host            = compc087.
 + Slave     6 runs on host            = compc078.
 + Slave     7 runs on host            = compc078.
 + Slave     4 runs on host            = compc078.
 + Slave     5 runs on host            = compc078.
 + Slave    10 runs on host            = compc074.
 + Slave    11 runs on host            = compc074.
 + Slave     8 runs on host            = compc074
 + Slave     9 runs on host            = compc074.
 + Slave    14 runs on host            = compc087.
 + Slave    15 runs on host            = compc087
 + Slave    13 runs on host            = compc087.
 =================
 + Slave    32 runs on host            = compc077.
 + Slave    25 runs on host            = compc084.
 + Slave    23 runs on host            = compc093.
 + Slave    29 runs on host            = compc076.
 + Slave    26 runs on host            = compc084.
 + Slave    18 runs on host            = compc090.
 + Slave    20 runs on host            = compc093.
 + Slave    21 runs on host            = compc093
 + Slave    30 runs on host            = compc076
 + Slave    27 runs on host            = compc084
 + Slave    19 runs on host            = compc090.
 + Slave    31 runs on host            = compc076.
 + Slave    24 runs on host            = compc084.
 + Slave    16 runs on host            = compc090.
 + Slave    22 runs on host            = compc093.
 + Slave    28 runs on host            = compc076
 + Slave    17 runs on host            = compc090
 Running CPU instructions in double precision.
 Estimating initial noise spectra
   0/   0 sec ............................................................~~(,_,">
WARNING: There are only 3 particles in group 1 of half-set 1
WARNING: You may want to consider joining some micrographs into larger groups to obtain more robust noise estimates.
         You can do so by using the same rlnMicrographName for particles from multiple different micrographs in the input STAR file.
         It is then best to join micrographs with similar defocus values and similar apparent signal-to-noise ratios.
 Auto-refine: Iteration= 1
 Auto-refine: Resolution= 50.5263 (no gain for 0 iter)
 Auto-refine: Changes in angles= 999 degrees; and in offsets= 999 pixels (no gain for 0 iter)
 Estimating accuracies in the orientational assignment ...
   0/   0 sec ......~~(,_,">[compc104:27849] *** Process received signal ***  [oo]
[compc104:27849] Signal: Segmentation fault (11)
[compc104:27849] Signal code: Address not mapped (1)
[compc104:27849] Failing at address: 0x2c8f000
[compc104:27849] [ 0] /lib64/libpthread.so.0(+0xf5d0)[0x2b24733b25d0]
[compc104:27849] [ 1] relion/3.0.7-gcc5.4.0-skylake-libtiff4.0.10/bin/relion_refine_mpi(_ZN11MlOptimiser30calculateExpectedAngularErrorsEll+0xb09)[0x5f68b9]
[compc104:27849] [ 2] relion/3.0.7-gcc5.4.0-skylake-libtiff4.0.10/bin/relion_refine_mpi(_ZN14MlOptimiserMpi11expectationEv+0x27c1)[0x45f591]
[compc104:27849] [ 3] relion/3.0.7-gcc5.4.0-skylake-libtiff4.0.10/bin/relion_refine_mpi(_ZN14MlOptimiserMpi7iterateEv+0xba)[0x46dc5a]
[compc104:27849] [ 4] relion/3.0.7-gcc5.4.0-skylake-libtiff4.0.10/bin/relion_refine_mpi(main+0x78)[0x4334e8]
[compc104:27849] [ 5] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2b24735e13d5]
[compc104:27849] [ 6] relion/3.0.7-gcc5.4.0-skylake-libtiff4.0.10/bin/relion_refine_mpi[0x433f7f]
[compc104:27849] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 27849 on node compc104 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
biochem-fan commented 4 years ago

What happens if you reduce the number of MPI per node to exclude the possibility of running out of memory?

P.S. We no longer support RELION 3.0.x. Please update to 3.1.0.

shahpnmlab commented 4 years ago

Hi Takanori, I have tried running it on fewer MPIs like 9,7,5 with different threads like 1, 2 & 4. The problem persists no matter which version of relion use including the latest stable release (3.1.0-commit-1e738e).

shahpnmlab commented 4 years ago

Setting --maxsig 2000 didnt help either...

biochem-fan commented 4 years ago

Using --maxsig to solve memory problems is valid only for GPU.

fewer MPIs like 9,7,5

Please try 1 MPI per node.

biochem-fan commented 4 years ago

Anyway, I cannot help because I am not involved with subtomo functionality at all. @joton might be able to help.

shahpnmlab commented 4 years ago

No worries! Thanks for looping @joton in...

joton commented 4 years ago

Thanks @biochem-fan, we're now in contact by email as well.

I don't remember to have such a problem during angular errors estimation either when I was using 3.0 or now in 3.1.0. Anyway. let's try first to repeat the process using Relion 3.1.0. If the problem persists it has to be somehow related to @shahpnmlab pipeline.