3dem / relion

Image-processing software for cryo-electron microscopy
https://relion.readthedocs.io/en/latest/
GNU General Public License v2.0
444 stars 197 forks source link

Blush may fallback to CPU-only when running a refinement on a single GPU #1033

Open DanGonite57 opened 10 months ago

DanGonite57 commented 10 months ago

Describe your problem

When running a refinement with only a single GPU available, Blush will reconstruct half2 with the GPU locked, while simultaneously waiting for a GPU lock in order to reconstruct half1. If reconstruction of half2 takes too long (> ~100 seconds), locking the GPU for half1 will timeout and cause reconstruction of half1 to be run on the CPU instead, thus causing the maximization step to run on the order of hours rather than minutes.

This may not be so much a bug report as more of a point of interest in case anyone else comes across this behaviour. I have circumvented the issue for my own use-case (https://github.com/DanGonite57/relion-blush/commit/fea1d386b1839a2897c5294a47bc198824d638d1), but this "solution" may not be suitable for everyone, as I am unclear whether the device lockfile can be influenced by anything other than Blush.

Dataset (merged):

Example external_reconstruct.logs

> cat Refine3D/job4458/run_it001_half2_class001_external_reconstruct.log
2023-11-15 22:04:17.724 | INFO     | relion_blush.command_line:main:271 - ARGUMENTS: Namespace(star_file='Refine3D/job4458/run_it001_half2_class001_external_reconstruct.star', model_name='v1.0', strides=20, batch_size=1, gpu='2,', debug=False, skip_spectral_trailing=False)
2023-11-15 22:04:17.867 | INFO     | relion_blush.command_line:main:283 - Loading model time 0.14 s
2023-11-15 22:04:18.491 | INFO     | relion_blush.command_line:main:315 - Selected device: cuda:2
2023-11-15 22:04:39.555 | INFO     | relion_blush.command_line:refine3d:44 - Resample time 19.86 s
2023-11-15 22:05:00.104 | INFO     | relion_blush.command_line:refine3d:58 - Volume preprocess time 20.55 s
2023-11-15 22:05:31.337 | INFO     | relion_blush.command_line:refine3d:75 - Radial masks time 31.23 s
2023-11-15 22:07:34.026 | INFO     | relion_blush.command_line:refine3d:88 - Running model time 122.69 s
2023-11-15 22:07:48.774 | INFO     | relion_blush.command_line:refine3d:98 - Running model time 14.75 s
2023-11-15 22:07:48.774 | INFO     | relion_blush.command_line:refine3d:135 - Applying spectral trailing
2023-11-15 22:08:01.872 | INFO     | relion_blush.command_line:refine3d:139 - Max denoised spectral index: 69
2023-11-15 22:08:01.873 | INFO     | relion_blush.command_line:refine3d:142 - Max denoised resolution: 13.03
2023-11-15 22:09:05.381 | INFO     | relion_blush.command_line:refine3d:151 - Ouput to file Refine3D/job4458/run_it001_half2_class001_external_reconstruct.mrc
> cat Refine3D/job4458/run_it001_half1_class001_external_reconstruct.log
2023-11-15 22:04:32.133 | INFO     | relion_blush.command_line:main:271 - ARGUMENTS: Namespace(star_file='Refine3D/job4458/run_it001_half1_class001_external_reconstruct.star', model_name='v1.0', strides=20, batch_size=1, gpu='2,', debug=False, skip_spectral_trailing=False)
2023-11-15 22:04:32.275 | INFO     | relion_blush.command_line:main:283 - Loading model time 0.14 s
2023-11-15 22:06:16.287 | INFO     | relion_blush.command_line:main:315 - Selected device: cpu
2023-11-15 22:06:36.450 | INFO     | relion_blush.command_line:refine3d:44 - Resample time 20.16 s
2023-11-15 22:06:56.566 | INFO     | relion_blush.command_line:refine3d:58 - Volume preprocess time 20.11 s
2023-11-15 22:07:27.913 | INFO     | relion_blush.command_line:refine3d:75 - Radial masks time 31.35 s
2023-11-16 00:27:22.697 | INFO     | relion_blush.command_line:refine3d:88 - Running model time 8394.78 s
2023-11-16 00:27:37.611 | INFO     | relion_blush.command_line:refine3d:98 - Running model time 14.91 s
2023-11-16 00:27:37.618 | INFO     | relion_blush.command_line:refine3d:135 - Applying spectral trailing
2023-11-16 00:27:51.616 | INFO     | relion_blush.command_line:refine3d:139 - Max denoised spectral index: 69
2023-11-16 00:27:51.616 | INFO     | relion_blush.command_line:refine3d:142 - Max denoised resolution: 13.03
2023-11-16 00:29:00.230 | INFO     | relion_blush.command_line:refine3d:151 - Ouput to file Refine3D/job4458/run_it001_half1_class001_external_reconstruct.mrc
DanGonite57 commented 3 months ago

I would just like to note that this also occurs, and is somewhat worse, when running 3D classification. Blush is run on each class, which means that CPU-fallback has a chance of occurring if the number of classes exceeds the number of GPUs assigned to the job, and can cause it to occur even for smaller box sizes and multiple GPUs when assigning multiple classes per GPU, as the last class(es) will fallback if blush is not completed on ALL the preceding classes before the timeout.

biochem-fan commented 3 months ago

@dkimanius Can we increase the number of maximum trials and/or the time out? Does it have any side effects?

dkimanius commented 3 months ago

@DanGonite57 Thanks for reporting this. I've increased the default timeout from 100 sec to 20 mins. You can also now set it manually through the environmental variable RELION_BLUSH_ARGS. In bash, for this particular setting you'd run:

export RELION_BLUSH_ARGS="--device_timeout <new-value>"

Just make sure to update to the latest relion-blush commit.

@biochem-fan There are no major practical issues that I can think of at this stage. During development/alpha testing I wanted it to crash faster, if it was going to, rather than hold on to resrouces longer.