3dem / relion

Image-processing software for cryo-electron microscopy
https://relion.readthedocs.io/en/latest/
GNU General Public License v2.0
453 stars 201 forks source link

Relion 3 refine3D error #457

Closed YehudaHalfon closed 4 years ago

YehudaHalfon commented 5 years ago

Hi all,

We have relion 3.0 and we get this error: ERROR: No orientation was found as better than any other.

A particle image was compared to the reference and resulted in all-zero weights (for all orientations). This should not happen, unless your data has very special characteristics. This has historically happened for some lower-precision calculations, but multiple fallbacks have since been implemented. Please report this error to the relion developers at

         github.com/3dem/relion/issues  

[rigatoni.weizmann.ac.il:116795] 1 more process has sent help message help-mpi-api.txt / mpi-abort [rigatoni.weizmann.ac.il:116795] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

I saw that most people gets it in 2D classification and couldn't find an answer for 3D refine.

Is there a fix for it?

Thanks,

Yehuda Halfon

mcianfrocco commented 5 years ago

I'd like to re-up to this post from @YehudaHalfon -

I've been trying to install RELION-3.0 onto a singularity container for running at a supercomputer in the US that has K80 / P100 GPU nodes. I was given (to the best of my knowledge) a recipe for building this container with the appropriate CUDA libraries, locations, etc., and I'm able to successfully compile RELION-3.0.

However, after it is compiled, I am unable to run 3D classification or 3D refinement, as I get the same error as @YehudaHalfon. BUT, I am able to run 2D classification.

I've tried 3.0.1 and 3.0.5 and have gotten the same issue. I'd also like to point out that if I try to run 3D refinement / 3D classification without GPU acceleration (but using the same code), I also get the same error as above.

It seems like there must be a mis-match between a CUDA library somewhere?

I'll point out one more aspect I noticed (and I'm not sure if @YehudaHalfon) noticed this: for this compiled version of RELION-3.0 that fails with 3D classification and refinement, I see this type of message in the standard out:

WARNING: There are only 0 particles in group 13 of half-set 1
WARNING: There are only 0 particles in group 14 of half-set 1
WARNING: There are only 0 particles in group 15 of half-set 1
WARNING: There are only 0 particles in group 16 of half-set 1
WARNING: There are only 0 particles in group 17 of half-set 1
WARNING: There are only 0 particles in group 18 of half-set 1
...

But if I run this on a version of RELION-3.0 that works locally, I do not see this message on the same dataset.

I realize that this is far outside the scope of 'normal' issues being reported, but any advice would be appreciated. I'm mostly wondering how you can go about trouble shooting a library mismatch that can result in this error (if at all).

Thank you, Mike

biochem-fan commented 5 years ago

Is the compiler (GCC or ICC, and its version) the same between your local build and the version built within the container?

mcianfrocco commented 5 years ago

Thank you for your response - within the singularity container, the gcc version is 5.4.0. However, on the cluster where I'm trying to run this, the gcc version is 4.9.2.

In principle, since I compiled RELION within the container and I'm running it within the container, I would expect the gcc version to be the same. Unless of course it relates to something to do with software running on the cluster that is conflicting with this. I'm still learning about containers and I'll look into this.

Thank you, Mike

biochem-fan commented 5 years ago

It should work but it might be a bug in RELION's code that surfaces only with a new compiler (like #453).

fredward commented 5 years ago

We are seeing a similar error on a new workstation with Ubuntu 18.04, gcc 7.4.0, and CUDA 10.1. The dataset 3d-refines fine on an older machine with Ubuntu 16.04, gcc 5.4.0, and CUDA 8.0.

Does anyone have any more insight into these issues? I am going to try another rebuild and make sure everything was configured correctly.

Thanks much, Fred

EDIT: I should add that the benchmark ribosome dataset works fine for class3D.

fredward commented 5 years ago

We traced our error to using an odd number of MPI-slaves as in #289 (we have 3 GPUs). Everything seems to be working great now if we stick to using 2 cards or 2-ranks per card for all 3.