3dem / relion

Image-processing software for cryo-electron microscopy
https://relion.readthedocs.io/en/latest/
GNU General Public License v2.0
436 stars 193 forks source link

"No orientation found" error during refinement of Bayesian polished particles #717

Closed martinpacesa closed 3 years ago

martinpacesa commented 3 years ago

Hello!

I am at the last step of 3D refinement of my map (170k particle dataset, going to 3.7A). At this point, I have performed Bayesian training with 40k particles (overkill) and got parameter values of 1.014 8130 1.92. When I perform training with 5k particles I get values of 1.128 7875 2.13. I have proceeded with polishing using the 40k training values and then used the shiny.star output for 3D refinement using a mask and model obtained from previous rounds of refinement.

I have tried the refinement three times already on 2 different processing machines, with either 2 GPUs-3MPIs-4threads or 4GPUs-5MPIs-10threads and each time the refinement gives me the following error during the 10th iteration of the refinement. I used the model and mask previously for refinements, the only difference this time is that I am using FSC solvent flattening and the shiny particles. Do you suggest to rerun the polishing with the 5k trained values?

WARNING: FSC curve between unmasked maps never drops below 0.8. Using unmasked FSC as FSC_true... WARNING: This message should go away during the later stages of refinement! fn_img= img_id= 0 adaptive_fraction= 0.999 min_diff2= 1129.05 Dumped data: error_dump_pdf_orientation, error_dump_pdf_orientation and error_dump_unsorted. in: /tmp/sbgrid/spack-stage/spack-stage-relion-3.1.0-5p3br6ikgbyq7aag6oilj2jhmtxzp5kq/spack-src/src/acc/acc_ml_optimiser_impl.h, line 1870 ERROR: No orientation was found as better than any other. A particle image was compared to the reference and resulted in all-zero weights (for all orientations). This should not happen, unless your data has very special characteristics. This has historically happened for some lower-precision calculations, but multiple fallbacks have since been implemented. Please report this error to the relion developers at

Environment: OS: CentOS 7, SBGrid environment MPI runtime: OpenMPI 3.1.6 RELION version 3.1.0-commit-GITDIR, cu10.2 Memory: 512 Gb GPU: 4xGeForce RTX 2080 Ti

Dataset: Box size: 384 ox Pixel size: 0.68 Å/px Number of particles: 175k Description: Globular protein of 150 kDa

Job options: Type of job: Refine3D Number of MPI processes: 5 Number of threads: 10 Full command: which relion_refine_mpi --o Refine3D/job100/run --auto_refine --split_random_halves --i Polish/job091/shiny.star --ref Refine3D/job085/run_class001.mrc --ini_high 50 --dont_combine_weights_via_disc --no_parallel_disc_io --preread_images --pool 30 --pad 2 --skip_gridding --ctf --ctf_corrected_ref --particle_diameter 185 --flatten_solvent --zero_mask --solvent_mask MaskCreate/job088/mask.mrc --solvent_correct_fsc --oversampling 1 --healpix_order 2 --auto_local_healpix_order 4 --offset_range 5 --offset_step 2 --sym C1 --low_resol_join_halves 40 --norm --scale --j 4 --gpu "1:2" --pipeline_control Refine3D/job100/

martinpacesa commented 3 years ago

Here is the output of error_dump_pdf_offset:

0 8.50363e-05 6.07781e-05 4.24327e-05 7.99523e-05 6.39454e-05 0.000185883 0.000140132 0.000138112 0.000232692 0.000219286 0.000261309 0.000203861 0.000212719 0.000240111 0.000138939 0.000170298 0.000211036 0.000221938 0.000254487 0.000191029 0.000278778 0.000313391 0.000248341 0.000203394 0.000288252 0.000290926 0.00033905 0.000304973 0.000313837 0.000370047 0.000349922 0.000342475 0.000363883 0.000444821 0.000379072 0.000430447

biochem-fan commented 3 years ago

See https://www3.mrc-lmb.cam.ac.uk/relion/index.php/FAQs#Refinement_or_Classification_crash_saying_.22No_orientation_was_found.22.

Try Class2D on shiny particles; sometimes detector artifacts concealed during initial motion correction (e.g. hot pixels and dead lines) come back after Polish.

Box size: 384 px Pixel size: 0.68 Å/px

Although this is not the cause of your problem, working in such a small pixel is waste of storage and processing time unless you are expecting 1.4 Å. Down-sample to a reasonable pixel size during Extraction and Polish.

martinpacesa commented 3 years ago

Thank you for your answer, I apologise, I googled for a while to find a similar issue, but didn't come accross the FAQ. I will try both rerunning the Bayesian polishing with different values and 2D classification, and report back.

martinpacesa commented 3 years ago

Reclassification solved the issue and improved the map to 3.3A! Thank you!

martinpacesa commented 3 years ago

After this, I did another round of aberration/beamtilt refinement, anisotropy and per particle CTF refinement. After that I tried auto-refine and got the same problem, the issue renders the server GPUs unusable and the machine has to be physically unplugged to make it recognise them again. 2D reclassification fixes this. Is there a way to add a check for such corrupted data?

biochem-fan commented 3 years ago

The GPU issue has nothing to do with RELION. It is more of kernel and driver's issue.

martinpacesa commented 3 years ago

This issue keeps persisting with a particular dataset we have, even after doing 2D reclassification this time. Same error as before. The 3D auto refinement keeps hanging without any error, but when I look at the processes in the command line I see that UVM_GPU3_BH process appears and that the GPUs are no longer recognised by the system when I check nvidia-smi. I have tried this on 3 separate machines and it happens everywhere. Our current nvidia driver version is 450.80.02 and CUDA 11.0. I just noticed our current relion version running through SBGrid is 3.1.1_cu9.2 rather than the default 3.1.1, could this cause the problems?

martinpacesa commented 3 years ago

I forgot to mention, I had this error on 3 separate datasets already