3dem / relion

Image-processing software for cryo-electron microscopy
https://relion.readthedocs.io/en/latest/
GNU General Public License v2.0
448 stars 199 forks source link

Blush with no-align Class3D #1020

Open h-ehara opened 11 months ago

h-ehara commented 11 months ago

Describe your problem

When Blush is combined with no-align Class3D, it seems that we can not use GPU (for blush), therefore, blush is very slow. (Or, could it be that blush with no-align class3d is not recommended or useful?)

Regards,

biochem-fan commented 11 months ago

I don't expect much benefit for no-align Class3D but if you want, you can hard-code your GPU ID in the following code.

https://github.com/3dem/relion/blob/635b59196f7b7cd37d7a9f95b65a880663a51bb5/src/ml_optimiser.cpp#L2533-L2541

Change this to blush_args = blush_args + " --gpu 0"; for example.

@scheres @dkimanius Do you think this use-case is worth supporting? We need to split the role of the do_gpu flag into two: one for alignment and the other for external reconstruction.

dkimanius commented 11 months ago

I don't think it's good to default to GPU 0 as it might be used by some other process.

Here's an alternative suggestion:

biochem-fan commented 11 months ago

@dkimanius Yes, gpu 0 is a hack for @h-ehara to run the job. If we decide to support Blush in no-align Class3D, we should do it more properly as you suggested. The real question is whether it is worth doing. No-alignment means smaller chance of overfitting, so I guess the benefit is probably smaller.

olibclarke commented 11 months ago

No alignment means smaller chance of overfitting, but we often use higher T-values with skip-align classifications in order to resolve subtle differences - in this case maybe there is some benefit?

biochem-fan commented 11 months ago

in this case maybe there is some benefit?

@h-ehara @olibclarke, please give us your feedback whether Blush was useful for no-align Class3D.

olibclarke commented 11 months ago

Will do - running jobs right now with/without blush to test.

olibclarke commented 11 months ago

@biochem-fan @dkimanius I can't say yet whether it is going to be useful downstream, still running tests, but what I will say is that the resulting classification volumes are a lot cleaner, less noisy and therefore easier to interpret with blush regularization enabled (at matched values of T).

scheres commented 11 months ago

Dari, how can one run a single execution of blush on a map? One would need to compare that with the blushed Class3D.On 8 Nov 2023 17:56, Oliver Clarke @.> wrote:CAUTION: This email originated from outside of the LMB.Do not click links or open attachments unless you recognize the sender and know the content is @.

@biochem-fan @dkimanius I can't say yet whether it is useful downstream, still running tests, but what I will say is that the resulting classification volumes are a lot cleaner, less noisy and therefore easier to interpret with blush regularization enabled (at matched values of T).

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

biochem-fan commented 11 months ago

@olibclarke

the resulting classification volumes are a lot cleaner, less noisy and therefore easier to interpret with blush regularization enabled (at matched values of T)

Note that Class3D outputs Blush-applied maps, while Refine3D's half maps are always without Blush.

h-ehara commented 11 months ago

It looks like results are at least very different between +blush/-blush, but maybe it is too early to say if it is good or bad. (skip_align tau2_fudge 6 strict_highres_exp 8)

In the conventional class3d, particles are sometimes classified by quality (like good class, intermediate class, bad class), but I feel like, with blush, particles are more likely to be classified by shape, even without alignment. But we can already tweak T, resolution cutoff, number of cycles.....etc , so it could be hard to tell how much useful it would be. (too many variables to test them all).

olibclarke commented 11 months ago

Yes, the results are very different in terms of class distribution in matched runs, so it is clearly making a difference - for some cases that could be good, in other cases perhaps not, but I think it would be a useful option to have. Certainly the volumes are easier to interpret, and intuitively I would think blush regularization might allow for better results with smaller masks?

stevew-Purdue commented 11 months ago

We came across a bug yesterday related to the same section of code in ml_optimiser.cpp (lines 2533-2541). When a user submits a Class3D job through Slurm with GPU acceleration enabled (but no GPUs specified) we get a traceback from attempting to run the following command: Command: relion_python_blush Class3D/job146/run_it001_class001_external_reconstruct.star --gpu ,

Apparently, gpu_ids is an empty string and so blush_args += gpu_ids + ","; in line 2537 adds a single comma after the --gpu instead of the expected comma-separated list of GPUs.

biochem-fan commented 11 months ago

@stevew-Purdue

Slurm with GPU acceleration enabled

Do you mean your SLURM hides non-requested GPUs by GRES and cgroups?

(but no GPUs specified)

This means the user did not request a GPU to SLURM. Then the user has to tell RELION not to use a GPU by saying No to Use GPU acceleration in the Compute tab.

stevew-Purdue commented 11 months ago

It's been a while since I looked at this closely but I believe we have Slurm set up to use the CUDA_VISIBLE_DEVICES environment variable to control what is available to a Slurm job. The user specifies in the RELION interface that they want to use GPU acceleration and it's not required to enter a value for the "Which GPUs to use" field. The help text for that field states, "This argument is not necessary. If left empty, the job itself will try to allocate available GPU resources."

We use an "extra" parameter (RELION_QSUB_EXTRA3, in our case) in the "Running" tab that allows the user to choose how many GPUs to use (defaults to 1). That value is supplied to Slurm which then allocates GPUs to the job and controls access to them through CUDA_VISIBLE_DEVICES.

biochem-fan commented 11 months ago

@stevew-Purdue What you described makes sense when a user wants to use a GPU. When a user does NOT work to use a GPU, Use GPU acceleration must be No.

stevew-Purdue commented 11 months ago

I must not be communicating the situation very well... The user indeed wants to use GPU acceleration but they do not want to select which specific GPUs to use in the "Which GPUs to use" field. They intentionally leave that field blank. That's what I meant when I said "(but no GPUs specified)". I should have been more clear on that; sorry!

biochem-fan commented 11 months ago

@stevew-Purdue Ah, I see what you mean.

https://github.com/3dem/relion/blob/635b59196f7b7cd37d7a9f95b65a880663a51bb5/src/ml_optimiser.cpp#L2536-L2537

I guess this should be:

        for (auto &d: gpuDevices)
            blush_args += d + ",";

That being said, Blush runs fine with --gpu=, on my computer.

Is your problem really caused by this argument? Can you show me the full trackback?

stevew-Purdue commented 11 months ago

Here's the traceback:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/apps/miniconda38/envs/relion-5.0/lib/python3.10/site-packages/relion_blush/command_line.py", line 319, in main
    class3d(
  File "/apps/miniconda38/envs/relion-5.0/lib/python3.10/site-packages/relion_blush/command_line.py", line 189, in class3d
    denoised_nv, _ = apply_model(
  File "/apps/miniconda38/envs/relion-5.0/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/apps/miniconda38/envs/relion-5.0/lib/python3.10/site-packages/relion_blush/util.py", line 185, in apply_model
    in_std = get_std_layer(volume)
  File "/apps/miniconda38/envs/relion-5.0/lib/python3.10/site-packages/relion_blush/util.py", line 88, in get_std_layer
    grid = get_local_std_torch(grid.unsqueeze(0), size=10)[0]
  File "/apps/miniconda38/envs/relion-5.0/lib/python3.10/site-packages/relion_blush/util.py", line 79, in get_local_std_torch
    grid = torch.nn.functional.conv3d(grid, kernel, padding='same')
RuntimeError: GET was unable to find an engine to execute this computation

Something went wrong in the external Python call...
Command: relion_python_blush Class3D/job146/run_it001_class001_external_reconstruct.star  --gpu ,

And the contents of note.txt:

 ++++ Executing new job on Thu Nov  9 12:54:12 2023
 ++++ with the following command(s):
`which relion_refine_mpi` --o Class3D/job146/run --i Select/job140/particles.star --ref Class3D/job117/run_it025_class002.mrc --firstiter_cc --ini_high 10 --dont_combine_weights_via_disc --pool 30 --pad 2  --ctf --iter 25 --tau2_fudge 50 --particle_diameter 320 --blush  --K 5 --flatten_solvent --zero_mask --oversampling 1 --healpix_order 3 --offset_range 5 --offset_step 2 --sym C1 --norm --scale  --helix --helical_outer_diameter 150 --helical_nr_asu 3 --helical_twist_initial 4.61 --helical_rise_initial 4.75 --helical_z_percentage 0.3 --helical_keep_tilt_prior_fixed --sigma_tilt 1.66667 --sigma_psi 3.33333 --sigma_rot 0 --j 6 --gpu ""  --pipeline_control Class3D/job146/
 ++++
stevew-Purdue commented 11 months ago

It turns out that the primary problem was a CUDA / Torch version mismatch. I changed the CUDA version from 11.1 to 11.8 to match the RELION Torch version (2.0.1).

olibclarke commented 8 months ago

Just noting here that there is now at least one example where BLUSH plus no-align classification has proven useful:

https://x.com/BJ_Greber/status/1758778828292247617?s=20

https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkae107/7609228?login=false

bwmr commented 4 months ago

Hi all,

Has anybody tested the following work-around for running a Class3D job without alignment but with Blush?

As described here by @dkimanius, one can directly pass arguments to Blush using the ENVAR RELION_BLUSH_ARGS="--gpu <gpu-id>".

I can't test it at the moment (or at least the currently installed beta3, commit ad0c1f2, it doesn't work), but I'll test it once our IT updates. If somebody is running the most recent commit, it would be good to know whether this ENVAR works. As it wouldn't require patching the source code and re-compiling, that would be quite good.

As @olibclarke mentioned, we also use classification without alignment quite often, in our experience it works very well to clean noisy datasets.

Best, Benedikt

dkimanius commented 4 months ago

RELION will pass the GPU settings onto the Blush exterrnal call even during skip-align, as of commit c818a0313d4895f82363485d83fb94791cba3ee2. Please let me know how it works.