Open daniel-s-d-larsson opened 3 years ago
I thought I fixed this issue at some point in 3.0.x. I am surprised to see this happening in 3.1.0. Now this is harder to debug, because this does not happen locally.
Just to make sure, can you try the latest commit in the ver3.1
branch?
This example is running at a facility, where I cannot easily recompile myself, so unfortunately, I cannot try the latest commit at the moment and I don't have things set up for testing on my local GPU workstation at the moment. I know this has happened to me before on my local machine, but perhaps that was in version 3.0.x.
at a facility, where I cannot easily recompile myself
You don't need root permission to compile RELION.
Just to follow up, version 3.1.1 seems to solve the issue for me.
I ran into the problem again. The change from before was that I set "Use parallel disc I/O" to "No". With the option set to "Yes" it works as intended.
Is this running over multiple nodes?
No, I run on a single node. These are the specs of the node:
Intel Xeon Gold 6132 (28 threads) 2 x NVidia V100 192 GB RAM Infiniband Network attached storage 166.72 GB SSD scratch (way to wimpy for cryo-EM needs...!)
I can fit 160k particles on the scratch and have to access the rest over the network.
We do plan to refactor the scratch system as it has many problems (e.g. #494).
I am afraid to say I might not be able to fix your problem until the refactoring, as I cannot reproduce this locally and this affects only small use cases (huge dataset with wimpy SSD).
This is understandable. I wanted to report back my finding in case it would help others with the same problem.
My particle stack is too large for the scratch disk, so only part of it can be transfered. But during the run, relion_refine_mpi does not find the particles correctly during the initial estimation of the noise spectra. See error messages below.
When omitting the --scratch_dir flag, things work as expected. Similar runs with smaller particle stacks copied to /scratch also works as expected.
Environment:
Dataset:
Job options:
note.txt
in the job directory):srun relion_refine_mpi --o Class3D/job127/run --i Subtract/job124/particles_subtracted.star --ref Refine3D/job112/run_class001.mrc --firstiter_cc --ini_high 30 --dont_combine_weights_via_disc --pool 100 --pad 1 --skip_gridding --ctf --ctf_corrected_ref --iter 100 --tau2_fudge 40 --particle_diameter 320 --K 6 --flatten_solvent --solvent_mask MaskCreate/job035/mask_0.82Apix_512px.mrc --skip_align --sym C1 --norm --scale --j 14 --pipeline_control Class3D/job127/ --scratch_dir /scratch
Error message:
run.out:
run.err: