Error with parts of the particle stack on the scratch disk

daniel-s-d-larsson commented 3 years ago

My particle stack is too large for the scratch disk, so only part of it can be transfered. But during the run, relion_refine_mpi does not find the particles correctly during the initial estimation of the noise spectra. See error messages below.

When omitting the --scratch_dir flag, things work as expected. Similar runs with smaller particle stacks copied to /scratch also works as expected.

Environment:

OS: Ubuntu 16.04.7 LTS
MPI runtime: OpenMPI 3.1.4
RELION version RELION 3.1.0 (stable release)
Memory: 128 GB
Scratch: 170GB of SSD

Dataset:

Box size: 512 px
Pixel size: 0.82 Å/px
Number of particles: 378,793

Job options:

Type of job: Class3D (6 classes, no align)
Number of MPI processes: 3
Number of threads: 14
Full command (see note.txt in the job directory):

srun relion_refine_mpi --o Class3D/job127/run --i Subtract/job124/particles_subtracted.star --ref Refine3D/job112/run_class001.mrc --firstiter_cc --ini_high 30 --dont_combine_weights_via_disc --pool 100 --pad 1 --skip_gridding --ctf --ctf_corrected_ref --iter 100 --tau2_fudge 40 --particle_diameter 320 --K 6 --flatten_solvent --solvent_mask MaskCreate/job035/mask_0.82Apix_512px.mrc --skip_align --sym C1 --norm --scale --j 14 --pipeline_control Class3D/job127/ --scratch_dir /scratch

Error message:

run.out:

RELION version: 3.1.0-commit-GITDIR 
Precision: BASE=double, CUDA-ACC=single 

 === RELION MPI setup ===
 + Number of MPI processes             = 3
 + Number of threads per MPI process   = 14
 + Total number of threads therefore   = 42
 + Master  (0) runs on host            = b-cn0303.hpc2n.umu.se
 + Slave     1 runs on host            = b-cn0303.hpc2n.umu.se
 =================
 + Slave     2 runs on host            = b-cn0303.hpc2n.umu.se
 Running CPU instructions in double precision. 
 + On host b-cn0303.hpc2n.umu.se: free scratch space = 166.72 Gb.
 Copying particles to scratch directory: /scratch/relion_volatile/
24.58/24.58 min ............................................................~~(,_,">
 For optics_group 1, there are 160481 particles on the scratch disk.
 Estimating initial noise spectra 
000/??? sec ~~(,_,">                                                          [oo]

run.err:

 Warning: scratch space full on b-cn0303.hpc2n.umu.se. Remaining 218312 particles will be read from where they were.
in: /scratch/eb-buildpath/RELION/3.1.0/fosscuda-2019b/relion-3.1.0/src/rwMRC.h, line 192
ERROR: 
readMRC: Image number 189398 exceeds stack size 160481 of image 189398@/scratch/relion_volatile/opticsgroup1_particles.mrcs
=== Backtrace  ===
/hpc2n/eb/software/MPI/GCC-CUDA/8.3.0-10.1.243/OpenMPI/3.1.4/RELION/3.1.0/bin/relion_refine_mpi(_ZN11RelionErrorC2ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES7_l+0x5f) [0x48452f]
/hpc2n/eb/software/MPI/GCC-CUDA/8.3.0-10.1.243/OpenMPI/3.1.4/RELION/3.1.0/bin/relion_refine_mpi(_ZN5ImageIdE7readMRCElbRK8FileName+0x74c) [0x4b888c]
/hpc2n/eb/software/MPI/GCC-CUDA/8.3.0-10.1.243/OpenMPI/3.1.4/RELION/3.1.0/bin/relion_refine_mpi(_ZN5ImageIdE5_readERK8FileNameR13fImageHandlerblbb+0x1ec) [0x4bc33c]
/hpc2n/eb/software/MPI/GCC-CUDA/8.3.0-10.1.243/OpenMPI/3.1.4/RELION/3.1.0/bin/relion_refine_mpi(_ZN11MlOptimiser41calculateSumOfPowerSpectraAndAverageImageER13MultidimArrayIdEb+0x3c9) [0x600589]
/hpc2n/eb/software/MPI/GCC-CUDA/8.3.0-10.1.243/OpenMPI/3.1.4/RELION/3.1.0/bin/relion_refine_mpi(_ZN14MlOptimiserMpi41calculateSumOfPowerSpectraAndAverageImageER13MultidimArrayIdE+0x2c) [0x49e7ac]
/hpc2n/eb/software/MPI/GCC-CUDA/8.3.0-10.1.243/OpenMPI/3.1.4/RELION/3.1.0/bin/relion_refine_mpi(_ZN14MlOptimiserMpi10initialiseEv+0x971) [0x4a5d31]
/hpc2n/eb/software/MPI/GCC-CUDA/8.3.0-10.1.243/OpenMPI/3.1.4/RELION/3.1.0/bin/relion_refine_mpi(main+0x4a) [0x4723da]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x149a14f84840]
/hpc2n/eb/software/MPI/GCC-CUDA/8.3.0-10.1.243/OpenMPI/3.1.4/RELION/3.1.0/bin/relion_refine_mpi(_start+0x29) [0x474d29]
==================
ERROR: 
readMRC: Image number 189398 exceeds stack size 160481 of image 189398@/scratch/relion_volatile/opticsgroup1_particles.mrcs
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 2 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 11040933.0 ON b-cn0303 CANCELLED AT 2021-01-05T09:29:37 ***
srun: error: b-cn0303: task 0: Killed
srun: error: b-cn0303: task 1: Killed
srun: error: b-cn0303: task 2: Exited with exit code 1

biochem-fan commented 3 years ago

I thought I fixed this issue at some point in 3.0.x. I am surprised to see this happening in 3.1.0. Now this is harder to debug, because this does not happen locally.

Just to make sure, can you try the latest commit in the ver3.1 branch?

daniel-s-d-larsson commented 3 years ago

This example is running at a facility, where I cannot easily recompile myself, so unfortunately, I cannot try the latest commit at the moment and I don't have things set up for testing on my local GPU workstation at the moment. I know this has happened to me before on my local machine, but perhaps that was in version 3.0.x.

biochem-fan commented 3 years ago

at a facility, where I cannot easily recompile myself

You don't need root permission to compile RELION.

daniel-s-d-larsson commented 3 years ago

Just to follow up, version 3.1.1 seems to solve the issue for me.

daniel-s-d-larsson commented 3 years ago

I ran into the problem again. The change from before was that I set "Use parallel disc I/O" to "No". With the option set to "Yes" it works as intended.

biochem-fan commented 3 years ago

Is this running over multiple nodes?

daniel-s-d-larsson commented 3 years ago

No, I run on a single node. These are the specs of the node:

Intel Xeon Gold 6132 (28 threads) 2 x NVidia V100 192 GB RAM Infiniband Network attached storage 166.72 GB SSD scratch (way to wimpy for cryo-EM needs...!)

I can fit 160k particles on the scratch and have to access the rest over the network.

biochem-fan commented 3 years ago

We do plan to refactor the scratch system as it has many problems (e.g. #494).

I am afraid to say I might not be able to fix your problem until the refactoring, as I cannot reproduce this locally and this affects only small use cases (huge dataset with wimpy SSD).

daniel-s-d-larsson commented 3 years ago

This is understandable. I wanted to report back my finding in case it would help others with the same problem.

3dem / relion

Error with parts of the particle stack on the scratch disk #721