3dem / relion

Image-processing software for cryo-electron microscopy
https://relion.readthedocs.io/en/latest/
GNU General Public License v2.0
456 stars 203 forks source link

Relion 4.0.1 | After Iteration 01, Reference is updated with NaNs #1065

Open jacob-r-anderson opened 10 months ago

jacob-r-anderson commented 10 months ago

Describe your problem

I have been using Relion 4.0.1 Pseudosubtomogram pipeline for a large dataset for several months. Despite having jobs succeed in the past with identical inputs, I have now invariable come across the error below issue. The error reproducibly occurs immediately after the first iteration. A similar post was made for 2DC in 2018, however the troubleshooting there did not solve this problem.

Bizarrely, I even see this problem if I use the EXACT same parameters from jobs run two months ago, but ones that did not throw the error.

What would cause an error to be thrown now, and not then, with identical parameters and inputs?

ERROR: 
No orientation was found as better than any other.

A particle image was compared to the reference and resulted in all-zero
weights (for all orientations). This should not happen, unless your data
has very special characteristics. This has historically happened for some 
lower-precision calculations, but multiple fallbacks have since been 
implemented. Please report this error to the relion developers at 

             github.com/3dem/relion/issues  

Environment:

Dataset:

Example of a command that tripped an error: which relion_refine_mpi --o Refine3D/job1155/run --auto_refine --split_random_halves --ios ReconstructParticleTomo/job915/optimisation_set.star --solvent_correct_fsc --ref ReconstructParticleTomo/job915/merged.mrc --firstiter_cc --ini_high 40 --dont_combine_weights_via_disc --pool 3 --pad 1 --auto_ignore_angles --auto_resol_angles --ctf --particle_diameter 2000 --flatten_solvent --zero_mask --oversampling 1 --healpix_order 5 --auto_local_healpix_order 5 --offset_range 4 --offset_step 2 --sym C1 --low_resol_join_halves 40 --norm --scale --j 4 --gpu "" --pipeline_control Refine3D/job1155/`

Example of the same command that did not trip an error: which relion_refine_mpi --o Refine3D/job916/run --auto_refine --split_random_halves --ios ReconstructParticleTomo/job915/optimisation_set.star --solvent_correct_fsc --ref ReconstructParticleTomo/job915/merged.mrc --firstiter_cc --ini_high 40 --dont_combine_weights_via_disc --pool 3 --pad 1 --auto_ignore_angles --auto_resol_angles --ctf --particle_diameter 2000 --flatten_solvent --zero_mask --oversampling 1 --healpix_order 5 --auto_local_healpix_order 5 --offset_range 4 --offset_step 2 --sym C1 --low_resol_join_halves 40 --norm --scale --j 4 --gpu "" --pipeline_control Refine3D/job916/`

Error message:


fn_img= PseudoSubtomo/job920/Subtomograms/TS052/542_data.mrc
 img_id= 0 adaptive_fraction= 0.999
 min_diff2= 3.40282e+38

 fn_img= PseudoSubtomo/job920/Subtomograms/TS113/715_data.mrc
 img_id= 0 adaptive_fraction= 0.999
 min_diff2= 3.40282e+38

 fn_img= PseudoSubtomo/job920/Subtomograms/TS099/906_data.mrc
 img_id= 0 adaptive_fraction= 0.999
 min_diff2= 3.40282e+38
Dumped data: error_dump_pdf_orientation, error_dump_pdf_orientation and error_dump_unsorted.
in: /tmp/sbgrid/spack-stage/spack-stage-relion-4.0.1-2orbjvmut3nz35g76te35djfkczeh3fg/spack-src/src/acc/acc_ml_optimiser_impl.h, line 1858
ERROR: 
No orientation was found as better than any other.

A particle image was compared to the reference and resulted in all-zero
weights (for all orientations). This should not happen, unless your data
has very special characteristics. This has historically happened for some 
lower-precision calculations, but multiple fallbacks have since been 
implemented. Please report this error to the relion developers at 

             github.com/3dem/relion/issues  

in: /tmp/sbgrid/spack-stage/spack-stage-relion-4.0.1-2orbjvmut3nz35g76te35djfkczeh3fg/spack-src/src/acc/acc_ml_optimiser_impl.h, line 1858
ERROR: 
ERROR: 
No orientation was found as better than any other.

A particle image was compared to the reference and resulted in all-zero
weights (for all orientations). This should not happen, unless your data
has very special characteristics. This has historically happened for some 
lower-precision calculations, but multiple fallbacks have since been 
implemented. Please report this error to the relion developers at 

             github.com/3dem/relion/issues  

 fn_img= PseudoSubtomo/job920/Subtomograms/TS146/173_data.mrc
 img_id= 0 adaptive_fraction= 0.999
 min_diff2= 3.40282e+38
Dumped data: error_dump_pdf_orientation, error_dump_pdf_orientation and error_dump_unsorted.
in: /tmp/sbgrid/spack-stage/spack-stage-relion-4.0.1-2orbjvmut3nz35g76te35djfkczeh3fg/spack-src/src/acc/acc_ml_optimiser_impl.h, line 1858
ERROR: 
No orientation was found as better than any other.

A particle image was compared to the reference and resulted in all-zero
weights (for all orientations). This should not happen, unless your data
has very special characteristics. This has historically happened for some 
lower-precision calculations, but multiple fallbacks have since been 
implemented. Please report this error to the relion developers at 

             github.com/3dem/relion/issues  

in: /tmp/sbgrid/spack-stage/spack-stage-relion-4.0.1-2orbjvmut3nz35g76te35djfkczeh3fg/spack-src/src/acc/acc_ml_optimiser_impl.h, line 1858
ERROR: 
ERROR: 
No orientation was found as better than any other.

A particle image was compared to the reference and resulted in all-zero
weights (for all orientations). This should not happen, unless your data
has very special characteristics. This has historically happened for some 
lower-precision calculations, but multiple fallbacks have since been 
implemented. Please report this error to the relion developers at 

             github.com/3dem/relion/issues  

Dumped data: error_dump_pdf_orientation, error_dump_pdf_orientation and error_dump_unsorted.
in: /tmp/sbgrid/spack-stage/spack-stage-relion-4.0.1-2orbjvmut3nz35g76te35djfkczeh3fg/spack-src/src/acc/acc_ml_optimiser_impl.h, line 1858
ERROR: 
No orientation was found as better than any other.

A particle image was compared to the reference and resulted in all-zero
weights (for all orientations). This should not happen, unless your data
has very special characteristics. This has historically happened for some 
lower-precision calculations, but multiple fallbacks have since been 
implemented. Please report this error to the relion developers at 

             github.com/3dem/relion/issues  

in: /tmp/sbgrid/spack-stage/spack-stage-relion-4.0.1-2orbjvmut3nz35g76te35djfkczeh3fg/spack-src/src/acc/acc_ml_optimiser_impl.h, line 1858
ERROR: 
ERROR: 
No orientation was found as better than any other.

A particle image was compared to the reference and resulted in all-zero
weights (for all orientations). This should not happen, unless your data
has very special characteristics. This has historically happened for some 
lower-precision calculations, but multiple fallbacks have since been 
implemented. Please report this error to the relion developers at 

             github.com/3dem/relion/issues  

Dumped data: error_dump_pdf_orientation, error_dump_pdf_orientation and error_dump_unsorted.
in: /tmp/sbgrid/spack-stage/spack-stage-relion-4.0.1-2orbjvmut3nz35g76te35djfkczeh3fg/spack-src/src/acc/acc_ml_optimiser_impl.h, line 1858
ERROR: 
No orientation was found as better than any other.

A particle image was compared to the reference and resulted in all-zero
weights (for all orientations). This should not happen, unless your data
has very special characteristics. This has historically happened for some 
lower-precision calculations, but multiple fallbacks have since been 
implemented. Please report this error to the relion developers at 

             github.com/3dem/relion/issues  

in: /tmp/sbgrid/spack-stage/spack-stage-relion-4.0.1-2orbjvmut3nz35g76te35djfkczeh3fg/spack-src/src/acc/acc_ml_optimiser_impl.h, line 1858
ERROR: 
ERROR: 
No orientation was found as better than any other.

A particle image was compared to the reference and resulted in all-zero
weights (for all orientations). This should not happen, unless your data
has very special characteristics. This has historically happened for some 
lower-precision calculations, but multiple fallbacks have since been 
implemented. Please report this error to the relion developers at 

             github.com/3dem/relion/issues  

follower 1 encountered error: === Backtrace  ===
/programs/x86_64-linux/relion/4.0.1_cu11.6/bin/relion_refine_mpi(_ZN11RelionErrorC2ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES7_l+0x5f) [0x48944f]
/programs/x86_64-linux/relion/4.0.1_cu11.6/bin/relion_refine_mpi() [0x47167a]
/programs/x86_64-linux/relion/4.0.1_cu11.6/bin/relion_refine_mpi() [0x6336b4]
/programs/x86_64-linux/gcc/8.4.0/gcc_extlib/gcc-8.4.0-nycd/lib64/libgomp.so.1(+0x1662e) [0x7fd3e6d2362e]
/lib64/libpthread.so.0(+0x81ca) [0x7fd3e78911ca]
/lib64/libc.so.6(clone+0x43) [0x7fd3e676ae73]
==================
ERROR: 
No orientation was found as better than any other.

A particle image was compared to the reference and resulted in all-zero
weights (for all orientations). This should not happen, unless your data
has very special characteristics. This has historically happened for some 
lower-precision calculations, but multiple fallbacks have since been 
implemented. Please report this error to the relion developers at 

             github.com/3dem/relion/issues  

--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
jacob-r-anderson commented 10 months ago

I have looked more closely at the output files. It appears that the error is likely because the references generated after the first iteration contain NaNs. Unclear why this is occuring now and not before.

Again - this is a problem where a job that did not throw an error now throws an error. The exact data input are identical as shown above.

_After the first iteration, run_it001_half1class001.mrc has NaNs.

header run_it001_half1_class001.mrc

 RO image file on unit   1 : run_it001_half1_class001.mrc     Size=     153532 K

 Number of columns, rows, sections .....     340     340     340
 Map mode ..............................    2   (32-bit real)              
 Start cols, rows, sects, grid x,y,z ...    0     0     0     340    340    340
 Pixel spacing (Angstroms)..............   3.360      3.360      3.360    
 Cell angles ...........................   90.000   90.000   90.000
 Fast, medium, slow axes ...............    X    Y    Z
 Origin on x,y,z .......................    0.000       0.000       0.000    
 Minimum density .......................          NaN
 Maximum density .......................          NaN
 Mean density ..........................          NaN
 tilt angles (original,current) ........   0.0   0.0   0.0   0.0   0.0   0.0
 Space group,# extra bytes,idtype,lens .        0        0        0        0

     1 Titles :
Relion    09-Jan-24  13:40:28      

Compared to:

RO image file on unit   1 : run_it000_half1_class001.mrc     Size=     153532 K

 Number of columns, rows, sections .....     340     340     340
 Map mode ..............................    2   (32-bit real)              
 Start cols, rows, sects, grid x,y,z ...    0     0     0     340    340    340
 Pixel spacing (Angstroms)..............   3.360      3.360      3.360    
 Cell angles ...........................   90.000   90.000   90.000
 Fast, medium, slow axes ...............    X    Y    Z
 Origin on x,y,z .......................    0.000       0.000       0.000    
 Minimum density ....................... -0.20950    
 Maximum density .......................  0.20464    
 Mean density ..........................  0.53888E-12
 RMS deviation from mean................  0.28042E-01
 tilt angles (original,current) ........   0.0   0.0   0.0   0.0   0.0   0.0
 Space group,# extra bytes,idtype,lens .        0        0        0        0