Open h-ehara opened 11 months ago
I don't expect much benefit for no-align Class3D but if you want, you can hard-code your GPU ID in the following code.
Change this to blush_args = blush_args + " --gpu 0";
for example.
@scheres @dkimanius Do you think this use-case is worth supporting? We need to split the role of the do_gpu
flag into two: one for alignment and the other for external reconstruction.
I don't think it's good to default to GPU 0 as it might be used by some other process.
Here's an alternative suggestion:
@dkimanius Yes, gpu 0
is a hack for @h-ehara to run the job. If we decide to support Blush in no-align Class3D, we should do it more properly as you suggested. The real question is whether it is worth doing. No-alignment means smaller chance of overfitting, so I guess the benefit is probably smaller.
No alignment means smaller chance of overfitting, but we often use higher T-values with skip-align classifications in order to resolve subtle differences - in this case maybe there is some benefit?
in this case maybe there is some benefit?
@h-ehara @olibclarke, please give us your feedback whether Blush was useful for no-align Class3D.
Will do - running jobs right now with/without blush to test.
@biochem-fan @dkimanius I can't say yet whether it is going to be useful downstream, still running tests, but what I will say is that the resulting classification volumes are a lot cleaner, less noisy and therefore easier to interpret with blush regularization enabled (at matched values of T).
Dari, how can one run a single execution of blush on a map? One would need to compare that with the blushed Class3D.On 8 Nov 2023 17:56, Oliver Clarke @.> wrote:CAUTION: This email originated from outside of the LMB.Do not click links or open attachments unless you recognize the sender and know the content is @.
@biochem-fan @dkimanius I can't say yet whether it is useful downstream, still running tests, but what I will say is that the resulting classification volumes are a lot cleaner, less noisy and therefore easier to interpret with blush regularization enabled (at matched values of T).
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>
@olibclarke
the resulting classification volumes are a lot cleaner, less noisy and therefore easier to interpret with blush regularization enabled (at matched values of T)
Note that Class3D outputs Blush-applied maps, while Refine3D's half maps are always without Blush.
It looks like results are at least very different between +blush/-blush, but maybe it is too early to say if it is good or bad. (skip_align tau2_fudge 6 strict_highres_exp 8)
In the conventional class3d, particles are sometimes classified by quality (like good class, intermediate class, bad class), but I feel like, with blush, particles are more likely to be classified by shape, even without alignment. But we can already tweak T, resolution cutoff, number of cycles.....etc , so it could be hard to tell how much useful it would be. (too many variables to test them all).
Yes, the results are very different in terms of class distribution in matched runs, so it is clearly making a difference - for some cases that could be good, in other cases perhaps not, but I think it would be a useful option to have. Certainly the volumes are easier to interpret, and intuitively I would think blush regularization might allow for better results with smaller masks?
We came across a bug yesterday related to the same section of code in ml_optimiser.cpp (lines 2533-2541). When a user submits a Class3D job through Slurm with GPU acceleration enabled (but no GPUs specified) we get a traceback from attempting to run the following command:
Command: relion_python_blush Class3D/job146/run_it001_class001_external_reconstruct.star --gpu ,
Apparently, gpu_ids
is an empty string and so blush_args += gpu_ids + ",";
in line 2537 adds a single comma after the --gpu
instead of the expected comma-separated list of GPUs.
@stevew-Purdue
Slurm with GPU acceleration enabled
Do you mean your SLURM hides non-requested GPUs by GRES and cgroups?
(but no GPUs specified)
This means the user did not request a GPU to SLURM. Then the user has to tell RELION not to use a GPU by saying No
to Use GPU acceleration
in the Compute
tab.
It's been a while since I looked at this closely but I believe we have Slurm set up to use the CUDA_VISIBLE_DEVICES environment variable to control what is available to a Slurm job. The user specifies in the RELION interface that they want to use GPU acceleration and it's not required to enter a value for the "Which GPUs to use" field. The help text for that field states, "This argument is not necessary. If left empty, the job itself will try to allocate available GPU resources."
We use an "extra" parameter (RELION_QSUB_EXTRA3, in our case) in the "Running" tab that allows the user to choose how many GPUs to use (defaults to 1). That value is supplied to Slurm which then allocates GPUs to the job and controls access to them through CUDA_VISIBLE_DEVICES.
@stevew-Purdue What you described makes sense when a user wants to use a GPU. When a user does NOT work to use a GPU, Use GPU acceleration
must be No
.
I must not be communicating the situation very well... The user indeed wants to use GPU acceleration but they do not want to select which specific GPUs to use in the "Which GPUs to use" field. They intentionally leave that field blank. That's what I meant when I said "(but no GPUs specified)". I should have been more clear on that; sorry!
@stevew-Purdue Ah, I see what you mean.
I guess this should be:
for (auto &d: gpuDevices)
blush_args += d + ",";
That being said, Blush runs fine with --gpu=,
on my computer.
Is your problem really caused by this argument? Can you show me the full trackback?
Here's the traceback:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/apps/miniconda38/envs/relion-5.0/lib/python3.10/site-packages/relion_blush/command_line.py", line 319, in main
class3d(
File "/apps/miniconda38/envs/relion-5.0/lib/python3.10/site-packages/relion_blush/command_line.py", line 189, in class3d
denoised_nv, _ = apply_model(
File "/apps/miniconda38/envs/relion-5.0/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/apps/miniconda38/envs/relion-5.0/lib/python3.10/site-packages/relion_blush/util.py", line 185, in apply_model
in_std = get_std_layer(volume)
File "/apps/miniconda38/envs/relion-5.0/lib/python3.10/site-packages/relion_blush/util.py", line 88, in get_std_layer
grid = get_local_std_torch(grid.unsqueeze(0), size=10)[0]
File "/apps/miniconda38/envs/relion-5.0/lib/python3.10/site-packages/relion_blush/util.py", line 79, in get_local_std_torch
grid = torch.nn.functional.conv3d(grid, kernel, padding='same')
RuntimeError: GET was unable to find an engine to execute this computation
Something went wrong in the external Python call...
Command: relion_python_blush Class3D/job146/run_it001_class001_external_reconstruct.star --gpu ,
And the contents of note.txt:
++++ Executing new job on Thu Nov 9 12:54:12 2023
++++ with the following command(s):
`which relion_refine_mpi` --o Class3D/job146/run --i Select/job140/particles.star --ref Class3D/job117/run_it025_class002.mrc --firstiter_cc --ini_high 10 --dont_combine_weights_via_disc --pool 30 --pad 2 --ctf --iter 25 --tau2_fudge 50 --particle_diameter 320 --blush --K 5 --flatten_solvent --zero_mask --oversampling 1 --healpix_order 3 --offset_range 5 --offset_step 2 --sym C1 --norm --scale --helix --helical_outer_diameter 150 --helical_nr_asu 3 --helical_twist_initial 4.61 --helical_rise_initial 4.75 --helical_z_percentage 0.3 --helical_keep_tilt_prior_fixed --sigma_tilt 1.66667 --sigma_psi 3.33333 --sigma_rot 0 --j 6 --gpu "" --pipeline_control Class3D/job146/
++++
It turns out that the primary problem was a CUDA / Torch version mismatch. I changed the CUDA version from 11.1 to 11.8 to match the RELION Torch version (2.0.1).
Just noting here that there is now at least one example where BLUSH plus no-align classification has proven useful:
https://x.com/BJ_Greber/status/1758778828292247617?s=20
https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkae107/7609228?login=false
Hi all,
Has anybody tested the following work-around for running a Class3D job without alignment but with Blush?
As described here by @dkimanius, one can directly pass arguments to Blush using the ENVAR RELION_BLUSH_ARGS="--gpu <gpu-id>"
.
I can't test it at the moment (or at least the currently installed beta3, commit ad0c1f2, it doesn't work), but I'll test it once our IT updates. If somebody is running the most recent commit, it would be good to know whether this ENVAR works. As it wouldn't require patching the source code and re-compiling, that would be quite good.
As @olibclarke mentioned, we also use classification without alignment quite often, in our experience it works very well to clean noisy datasets.
Best, Benedikt
RELION will pass the GPU settings onto the Blush exterrnal call even during skip-align, as of commit c818a0313d4895f82363485d83fb94791cba3ee2. Please let me know how it works.
Describe your problem
When Blush is combined with no-align Class3D, it seems that we can not use GPU (for blush), therefore, blush is very slow. (Or, could it be that blush with no-align class3d is not recommended or useful?)
Regards,