Closed kellogg-cryoem closed 4 years ago
This is all working as expected. What did you expect in each case?
Maybe I didn't explain it clearly enough... In all cases the GPUs should be evenly distributed across the available devices. In the FIRST case, all GPUs get assigned to the first device (0) even though I specified that more than one GPU should be used (-np 5) and the GPU list of devices that should be used: (0 - 3). The first case is the one that violates expectations. The rest (2-3) are normal.
Are you aware of the difference between comma and colon?
Provide a list of which GPUs (0,1,2,3, etc) to use. MPI-processes are separated by ':'. For example, to place one rank on device 0 and one rank on device 1, provide '0:1
-_- thank you Takanori. It works, my mistake.
I’ve encountered this very weird bug in RELION, my version is 3.0.5.
/usr/local/bin/mpirun -np 5
which relion_refine_mpi
--o Refine3D/job023/run --auto_refine --split_random_halves --i smalltest.star --ref cryosparc_P23_J87_005_volume_map.mrc --ini_high 5 --dont_combine_weights_via_disc --no_parallel_disc_io --pool 3 --pad 2 --ctf --ctf_corrected_ref --flatten_solvent --zero_mask --oversampling 1 --healpix_order 2 --auto_local_healpix_order 4 --offset_range 5 --offset_step 2 --sym C1 --low_resol_join_halves 40 --norm --scale --pipeline_control Refine3D/job023/ --dont_check_norm --gpu 0,1,2,3Slave 1 will distribute threads over devices 0 1 2 3 Thread 0 on slave 1 mapped to device 0 Slave 2 will distribute threads over devices 0 1 2 3 Thread 0 on slave 2 mapped to device 0 Slave 3 will distribute threads over devices 0 1 2 3 Thread 0 on slave 3 mapped to device 0 Slave 4 will distribute threads over devices 0 1 2 3 Thread 0 on slave 4 mapped to device 0 Device 0 on cbsukellogg.biohpc.cornell.edu is split between 4 slaves
/usr/local/bin/mpirun -np 5
which relion_refine_mpi
--o Refine3D/job023/run --auto_refine --split_random_halves --i smalltest.star --ref cryosparc_P23_J87_005_volume_map.mrc --ini_high 5 --dont_combine_weights_via_disc --no_parallel_disc_io --pool 3 --pad 2 --ctf --ctf_corrected_ref --flatten_solvent --zero_mask --oversampling 1 --healpix_order 2 --auto_local_healpix_order 4 --offset_range 5 --offset_step 2 --sym C1 --low_resol_join_halves 40 --norm --scale --pipeline_control Refine3D/job023/ --dont_check_norm --gpuGPU-ids not specified for this rank, threads will automatically be mapped to available devices. Thread 0 on slave 1 mapped to device 0 GPU-ids not specified for this rank, threads will automatically be mapped to available devices. Thread 0 on slave 2 mapped to device 2 GPU-ids not specified for this rank, threads will automatically be mapped to available devices. Thread 0 on slave 3 mapped to device 4 GPU-ids not specified for this rank, threads will automatically be mapped to available devices. Thread 0 on slave 4 mapped to device 6 ^C Running CPU instructions in double precision.
/usr/local/bin/mpirun -np 5
which relion_refine_mpi
--o Refine3D/job023/run --auto_refine --split_random_hs --i smalltest.star --ref cryosparc_P23_J87_005_volume_map.mrc --ini_high 5 --dont_combine_weights_via_disc --no_parallel_disc_io --pool 3 --pad 2 --ctf --ctf_corrected_ref --flatten_solvent --zero_mask --oversampling 1 --healpix_order 2 --auto_local_healpix_order 4 --offset_range 5 --offset_step 2 --sym C1 --low_resol_join_halves 40 --norm --scale --pipeline_control Refine3D/job023/ --dont_check_norm --gpu 0,1,2,3 --j 5Slave 1 will distribute threads over devices 0 1 2 3 Thread 0 on slave 1 mapped to device 0 Thread 1 on slave 1 mapped to device 1 Thread 2 on slave 1 mapped to device 2 Thread 3 on slave 1 mapped to device 3 Thread 4 on slave 1 mapped to device 0 Slave 2 will distribute threads over devices 0 1 2 3 Thread 0 on slave 2 mapped to device 0 Thread 1 on slave 2 mapped to device 1 Thread 2 on slave 2 mapped to device 2 Thread 3 on slave 2 mapped to device 3 Thread 4 on slave 2 mapped to device 0 Slave 3 will distribute threads over devices 0 1 2 3 Thread 0 on slave 3 mapped to device 0 Thread 1 on slave 3 mapped to device 1 Thread 2 on slave 3 mapped to device 2 Thread 3 on slave 3 mapped to device 3 Thread 4 on slave 3 mapped to device 0 Slave 4 will distribute threads over devices 0 1 2 3 Thread 0 on slave 4 mapped to device 0 Thread 1 on slave 4 mapped to device 1 Thread 2 on slave 4 mapped to device 2 Thread 3 on slave 4 mapped to device 3 Thread 4 on slave 4 mapped to device 0
I can’t figure out if this bug has already been reported/addressed or not.