Open DanGonite57 opened 10 months ago
I would just like to note that this also occurs, and is somewhat worse, when running 3D classification. Blush is run on each class, which means that CPU-fallback has a chance of occurring if the number of classes exceeds the number of GPUs assigned to the job, and can cause it to occur even for smaller box sizes and multiple GPUs when assigning multiple classes per GPU, as the last class(es) will fallback if blush is not completed on ALL the preceding classes before the timeout.
@dkimanius Can we increase the number of maximum trials and/or the time out? Does it have any side effects?
@DanGonite57 Thanks for reporting this. I've increased the default timeout from 100 sec to 20 mins. You can also now set it manually through the environmental variable RELION_BLUSH_ARGS. In bash, for this particular setting you'd run:
export RELION_BLUSH_ARGS="--device_timeout <new-value>"
Just make sure to update to the latest relion-blush commit.
@biochem-fan There are no major practical issues that I can think of at this stage. During development/alpha testing I wanted it to crash faster, if it was going to, rather than hold on to resrouces longer.
Describe your problem
When running a refinement with only a single GPU available, Blush will reconstruct half2 with the GPU locked, while simultaneously waiting for a GPU lock in order to reconstruct half1. If reconstruction of half2 takes too long (> ~100 seconds), locking the GPU for half1 will timeout and cause reconstruction of half1 to be run on the CPU instead, thus causing the maximization step to run on the order of hours rather than minutes.
This may not be so much a bug report as more of a point of interest in case anyone else comes across this behaviour. I have circumvented the issue for my own use-case (https://github.com/DanGonite57/relion-blush/commit/fea1d386b1839a2897c5294a47bc198824d638d1), but this "solution" may not be suitable for everyone, as I am unclear whether the device lockfile can be influenced by anything other than Blush.
Dataset (merged):
Example external_reconstruct.logs