3dem / relion

Image-processing software for cryo-electron microscopy
https://relion.readthedocs.io/en/latest/
GNU General Public License v2.0
453 stars 201 forks source link

Relion (v4 or v5) crashing after resuming Refine3D from optimiser.star #1046

Open krogala opened 11 months ago

krogala commented 11 months ago

Dear Developers,

Any insights into how to fix this type of Refine3D crash would be most helpful!

Many thanks,

Kacper

ERROR DESCRIPTION: Any attempt to resume Refine3D from an optimiser.star file ends with the following error for me. This is new, and I noticed that it only started happening after a new Relion compilation.

HealpixSampling::writeBildFileOrientationalDistribution XSIZE(pdf_direction) != rot_angles.size

EXTRA INFO: Here are the modules that I'm using to compile (and run) Relion v4.0.1:

cmake/3.20.3
fftw/3.3.10 (also tried with -DFORCE_OWN_FFTW=ON -- no difference)
fltk/1.3.4
libtiff/4.5.0
libpng/1.6.29
x11/7.7
ghostscript/9.53.2
openmpi/4.1.2
cuda/11.5.0 (also tried with 12.2.0 -- no difference)

From CMakeCache.txt, the gcc compilers were as follows:

gcc/10.1.0
mpicc/4.1.2

I even went ahead and compiled xpdf/4.04 with qt/5.9.1 -- just in case this was a PDF reader error of sorts -- no difference.

Below are the full run.out and run.err files.

run.err

The following warnings were encountered upon command-line parsing: 
WARNING: Option --relax_sym is not a valid RELION argument
 XSIZE(pdf_direction)= 12288 rot_angles.size()= 6176
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
in: /home/groups/rogala/SOFTWARE/relion/v4.0.1/relion/src/healpix_sampling.cpp, line 2003
ERROR: 
HealpixSampling::writeBildFileOrientationalDistribution XSIZE(pdf_direction) != rot_angles.size()!
=== Backtrace  ===
/home/groups/rogala/SOFTWARE/relion/v4.0.1/relion/build/bin/relion_refine_mpi(_ZN11RelionErrorC2ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES7_l+0x63) [0x4b4fd3]
/home/groups/rogala/SOFTWARE/relion/v4.0.1/relion/build/bin/relion_refine_mpi(_ZN15HealpixSampling38writeBildFileOrientationalDistributionER13MultidimArrayIdER8FileNameddPK8Matrix2DIdEPK8Matrix1DIdEdd+0xf94) [0x5a6864]
/home/groups/rogala/SOFTWARE/relion/v4.0.1/relion/build/bin/relion_refine_mpi(_ZN7MlModel5writeE8FileNameR15HealpixSamplingbb+0x179e) [0x64d31e]
/home/groups/rogala/SOFTWARE/relion/v4.0.1/relion/build/bin/relion_refine_mpi(_ZN11MlOptimiser5writeEbbbbi+0x335) [0x6732e5]
/home/groups/rogala/SOFTWARE/relion/v4.0.1/relion/build/bin/relion_refine_mpi(_ZN14MlOptimiserMpi10initialiseEv+0x2b2) [0x4dd9a2]
/home/groups/rogala/SOFTWARE/relion/v4.0.1/relion/build/bin/relion_refine_mpi(main+0x4e) [0x4a3fde]
/lib64/libc.so.6(__libc_start_main+0xf5) [0x7f3ed7b16555]
/home/groups/rogala/SOFTWARE/relion/v4.0.1/relion/build/bin/relion_refine_mpi() [0x4a719e]
==================
ERROR: 
HealpixSampling::writeBildFileOrientationalDistribution XSIZE(pdf_direction) != rot_angles.size()!
srun: error: sh03-17n09: tasks 0-1: Killed
srun: Terminating StepId=37626753.0
srun: error: sh03-17n09: task 2: Exited with exit code 1

run.out

NVIDIA GRAPHICS CARD INFO:
Sun Dec 10 23:20:42 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A40                     On  | 00000000:46:00.0 Off |                    0 |
|  0%   34C    P8              31W / 300W |      4MiB / 46068MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A40                     On  | 00000000:C6:00.0 Off |                    0 |
|  0%   31C    P8              27W / 300W |      4MiB / 46068MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
Finished loading requested LMOD environments: relion/4.0.1.
Starting job execution at Sun Dec 10 23:20:44 PST 2023
Running on hosts:sh03-17n09
Running on 1 nodes.
SLURM job ID is 37626753.
RELION version: 4.0.1-commit-db9717 
Precision: BASE=double

 Reading in optimiser.star ...
 === RELION MPI setup ===
 + Number of MPI processes             = 3
 + Leader  (0) runs on host            = sh03-17n09.int
 + Follower     1 runs on host            = sh03-17n09.int
 + Follower     2 runs on host            = sh03-17n09.int
 =================
 uniqueHost sh03-17n09.int has 2 ranks.
GPU-ids not specified for this rank, threads will automatically be mapped to available devices.
 Thread 0 on follower 1 mapped to device 0
GPU-ids not specified for this rank, threads will automatically be mapped to available devices.
 Thread 0 on follower 2 mapped to device 1
 Running CPU instructions in double precision. 

 RELION version: 4.0.1-commit-db9717
 exiting with an error ...
biochem-fan commented 11 months ago

This is new, and I noticed that it only started happening after a new Relion compilation.

You wrote "4.0.1-commit-db9717". Was this commit working fine before you recompiled your binary? Or were you using an earlier commit of RELION 4.0.1?

Also: does this happen on any Refine3D jobs, or on a particular job? In the latter case, does it happen if you continue from earlier iterations?

krogala commented 10 months ago

Thank you for your quick response, @biochem-fan!

Indeed, you pointed out the exact issue that's been happening -- regarding only specific jobs throwing this error. I have now spent some time doing extensive testing of this phenomenon, and it looks like all "regular" Relion Refine3D jobs run well (and continue properly from optimiser.star) on either: v4.0.0, v4.0.1, or v5.0 -- with and without Blush regularization.

The only instance where I'm currently seeing this HealpixSampling error is when working with particles imported from cryoSPARC (using pyem's -- csparc2star.py). Originally, these particles come from Relion (after polishing), and were then temporarily moved to cryoSPARC for some 3DVA work. I want to bring them back to Relion, and technically, Refine3D jobs with these particles run to completion -- but only if the whole run completes without interruption. The moment it crashes (due to VRAM etc) and the job is resumed from optimiser.star, I am getting the Healpix error. There is no difference whether I choose the latest optimiser.star or an earlier one.

Any suggestions about what to look for would be great! I will try examining individual columns to check whether some of these values are causing the problem. I tried replacing the entire _dataoptics table, but that didn't help.

biochem-fan commented 10 months ago

Thank you very much for detailed investigation. Unfortunately, I have no idea, as I don't use CS at all. I suggest you to report this to the CCPEM mailing list. Others might be facing the same issue and have workarounds.

krogala commented 10 months ago

Alright! I found what the issue is. Rather unexpected, because it seems like it has nothing to do where the .star file came from. In this, case the cryosSPARC imported .star file checks out.

The problem is the --relax_sym parameter. Essentially, when resuming any Refine3D job from optimiser.star -- that was originally started with a --relax_sym C2 parameter, I get the following warning and crash.

WARNING: Option --relax_sym is not a valid RELION argument

XSIZE(pdf_direction)= 192 rot_angles.size()= 96
in: /home/groups/rogala/SOFTWARE/relion/v5.0/relion/src/healpix_sampling.cpp, line 2003
ERROR: 
HealpixSampling::writeBildFileOrientationalDistribution XSIZE(pdf_direction) != rot_angles.size()!

No other parameter seems to trigger it. Also, when resuming the job with --relax_sym parameter empty, the job crashes just as well.

This is true, as of version: 5.0-beta-0-commit-90d239.

The optimiser.star files are practically identical between the two "treatments" with/without the relax_sym parameter specified.

However, for those sampling.star files that have: _rlnHealpixOrder=2, I can see the following difference:

Is this the difference that the error is pointing to?