particle subtraction crashes

diffracteD commented 2 years ago

Hi,

I'm running a signal subtraction job in relion 3.1 usingoptimizer.star file from multibody refinement and mask from the same job. The command relion generates is: `which relion_particle_subtract_mpi --i MultiBody/job139/run_it011_optimiser.star --mask 3body/a1cMask.mrc --o Subtract/job149/ --recenter_on_mask --pipeline_control Subtract/job149/`

The job is running till reading the optimizer file and it crashes with error.log:

--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 44 with PID 118818 on node rome0114 exited on signal 9 (Killed).
--------------------------------------------------------------------------
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=2401268.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

I'm using 50 MPI to run on a cluster and 5800m per CPU memory. Please advise if I'm doing something wrong. Thanks.

biochem-fan commented 2 years ago

Depending on the number of bodies and the box size, 5.8 GB/process might not be enough. What if you run fewer processes with more memory/process? What if you run the non-MPI version?

diffracteD commented 2 years ago

I tried 1 MPI and 7800m memory. It still crashes with same error.

biochem-fan commented 2 years ago

Why do you have only so few? Can you try assigning all RAM on the node?

diffracteD commented 2 years ago

Finally adding more RAM (192GB) and using 2MPI worked. I was working on a cluster not administrated by me so it took a while to figure out the higher cutoff of RAM available. Thank you so much for the helpful advise. Much appreciated.

3dem / relion

particle subtraction crashes #867