3dem / relion

Image-processing software for cryo-electron microscopy
https://relion.readthedocs.io/en/latest/
GNU General Public License v2.0
450 stars 201 forks source link

2D classification job not starting on multi-GPU node #282

Closed chancie closed 5 years ago

chancie commented 7 years ago

I am aware that a similar issues have been reported in the past but without any resolution. Hence, submitting it as a new issue. I am currently struggling to get my 2D classification jobs running. The input has ~ 2 million particles.

I am running my jobs with Relion-2.1-beta-1. The node has 32 CPUs with 4 1080s. The command that I have submitted is as follows:

which relion_refine_mpi --o Class2D/job051/run --i ./Extract/job037/particles.star --dont_combine_weights_via_disc --no_parallel_disc_io --pool 50 --c tf --ctf_intact_first_peak --iter 25 --write_subsets 1 --subset_size 25000 --max_subsets 5 --tau2_fudge 2 --particle_diameter 150 --K 100 --flatten_solvent --zero_mask -- strict_highres_exp 12 --oversampling 1 --psi_step 12 --offset_range 5 --offset_step 2 --norm --scale --j 1 --dont_check_norm --maxsig 50

It is simply stuck for now 5 hours at estimating initial noise spectra step (see blow)

=== RELION MPI setup ===

I have submitted a similar job to a workstation with with 16 CPUs and a 1080, and the job starts immediately with following command

which relion_refine --o Class2D/job052/run --i ./Extract/job037/particles.star --dont_combine_weights_via_disc --no_parallel_disc_io --pool 50 --ctf --ctf_intact_first_pea k --iter 25 --tau2_fudge 2 --particle_diameter 150 --K 100 --flatten_solvent --zero_mask --strict_highres_exp 12 --oversampling 1 --psi_step 12 --offset_range 5 --offset_s tep 2 --norm --scale --j 2 --gpu "" --dont_check_norm --maxsig 20

The gpu-ids not specified, threads will automatically be mapped to devices (incrementally). Thread 0 mapped to device 0 Thread 1 mapped to device 0 Running CPU instructions in double precision.

Any idea what the issue could be? Thank you in advance for helping out.

bforsbe commented 7 years ago

Using 5 MPIs might cause high use of RAM, that combined with syncing between the MPI-ranks might cause some serious delay at some point. Try using 4 GPUs without MPI. You can just run relion_refine with --j 8, all 4 GPUs will still be used, in your case each one using 2 threads.

biochem-fan commented 5 years ago

No response for a long time. Closing.