3dem / DynaMight

Tool for reconstruction and analysis of continuous heterogeneity of a cryo-EM dataset
Other
22 stars 7 forks source link

Volume too large #13

Open yuehuang2023 opened 3 weeks ago

yuehuang2023 commented 3 weeks ago

Hi, I tried to reproduce the results of EMPIAR-10073 with GPU A6000 and set the parameters according to the supplementary. However, Relion reports errors. Any suggestions for solving this error? Thank you.

The run log is

Initializing the particle dataset
Assigning a diameter of 512 angstrom
Number of particles: 138899
Initialized data loaders for half sets of size 62505  and  62505
consensus updates are done every  0  epochs.
box size: 380 pixel_size: 1.400011 virtual pixel_size: 0.0026246719160104987  dimension of latent space:  10
Number of used gaussians: 30000
Optimizing scale only
volume too large: change size of output volumes. (If you want the original box size for the output volumes use a bigger gpu. The size of tensor a (380) must match the size of tensor b (190) at non-singleton dimension 2
Optimizing scale only
Initializing gaussian positions from reference
100%|##########| 50/50 [00:07<00:00,  6.29it/s]
Final error: 5.322801257534593e-07
Optimizing scale only
Initializing gaussian positions from reference
100%|##########| 50/50 [00:08<00:00,  6.12it/s]
Final error: 5.322801257534593e-07
consensus gaussian models initialized
consensus model  initialization finished
mean distance in graph for half 1: 2.4982950687408447 Angstrom ;This distance is also used to construct the initial graph 
mean distance in graph for half 2: 2.4982950687408447 Angstrom ;This distance is also used to construct the initial graph 
Computing half-set indices
100%|##########| 218/218 [00:14<00:00, 15.24it/s]
setting epoch type
generating graphs
100%|#########9| 217/218 [00:32<00:00,  6.77it/s]
Index tensor must have the same number of dimensions as self tensor

The run error is

/.conda/envs/relion-5.0/lib/python3.10/site-packages/dynamight/models/decoder.py:235: UserWarning: Using a target size (torch.Size([190, 190, 190])) that is different to the input size (torch.Size([380, 380, 380])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size.
  loss = torch.nn.functional.mse_loss(
huwjenkins commented 3 days ago

Yes box size of 360 px is hardcoded at multiple places in the code:

https://github.com/3dem/DynaMight/blob/eef4aa673af6cc908042b38646ae489ee8f2fde9/dynamight/models/decoder.py#L227

https://github.com/3dem/DynaMight/blob/eef4aa673af6cc908042b38646ae489ee8f2fde9/dynamight/models/decoder.py#L683

https://github.com/3dem/DynaMight/blob/eef4aa673af6cc908042b38646ae489ee8f2fde9/dynamight/models/decoder.py#L861

https://github.com/3dem/DynaMight/blob/eef4aa673af6cc908042b38646ae489ee8f2fde9/dynamight/models/decoder.py#L904

I couldn't find this mentioned in the Nature Methods paper and as @yuehuang2023 points out one of the example datasets used a box size of 380 px. @schwabjohannes, @scheres - why is 360 px hardcoded as a limit? The message:

If you want the original box size for the output volumes use a bigger gpu

seems a bit disingenuous when 360 px appears to be a hard-coded limit?

I also encountered the same message when running on one of my datasets with 384 px box.

scheres commented 3 days ago

Please, don't call something disingenuous so carelessly. A simple look at the code shows that 360px is not hardcoded as a limit. What is coded is an automated down-scaling in case the size goes above 360. This will be triggered with the example dataset.

The error below should only be raised when an exception is encountered, supposedly when you run out of GPU memory. Perhaps you can try as suggested and run on a bigger GPU?

On 11/5/24 09:21, Huw Jenkins wrote:

CAUTION: This email originated from outside of the LMB: @.** Do not click links or open attachments unless you recognize the sender and know the content is safe. If you think this is a phishing email, please forward it to **@.***

--

Yes box size of 360 px is hardcoded at multiple places in the code:

https://github.com/3dem/DynaMight/blob/eef4aa673af6cc908042b38646ae489ee8f2fde9/dynamight/models/decoder.py#L227

https://github.com/3dem/DynaMight/blob/eef4aa673af6cc908042b38646ae489ee8f2fde9/dynamight/models/decoder.py#L683

https://github.com/3dem/DynaMight/blob/eef4aa673af6cc908042b38646ae489ee8f2fde9/dynamight/models/decoder.py#L861

https://github.com/3dem/DynaMight/blob/eef4aa673af6cc908042b38646ae489ee8f2fde9/dynamight/models/decoder.py#L904

I couldn't find this mentioned in the Nature Methods paper and as @yuehuang2023 https://github.com/yuehuang2023 points out one of the example datasets used a box size of 380 px. @schwabjohannes https://github.com/schwabjohannes, @scheres https://github.com/scheres - why is 360 px hardcoded as a limit? The message:

|If you want the original box size for the output volumes use a bigger gpu |

seems a bit disingenuous when 360 px appears to be a hard-coded limit?

I also encountered the same message when running on one of my datasets with 384 px box.

— Reply to this email directly, view it on GitHub https://github.com/3dem/DynaMight/issues/13#issuecomment-2456646700, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFOHJCP37LSZ7YFPHOSEHU3Z7CEZPAVCNFSM6AAAAABP6F5RV6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINJWGY2DMNZQGA. You are receiving this because you were mentioned.Message ID: @.***>

-- Sjors Scheres MRC Laboratory of Molecular Biology Francis Crick Avenue, Cambridge Biomedical Campus Cambridge CB2 0QH, U.K. tel: +44 (0)1223 267061 http://www2.mrc-lmb.cam.ac.uk/groups/scheres

huwjenkins commented 3 days ago

Yes you are correct the message is triggered by running out of GPU memory. Sorry I should have looked more carefully. I was running on an A40 with 48 GB which I thought was quite a big GPU!

However, the volume will still be downscaled by 2 with a 384 px box. Should I crop the particles to 360 px?

yuehuang2023 commented 3 days ago

I used the GPU A6000 with the same configuration mentioned in the supplementary, but this error was still raised.
image

huwjenkins commented 2 days ago

I got DynaMight running on a H100 and with my dataset (384 px box) I got the same errors:

box size: 384 pixel_size: 0.825 virtual pixel_size: 0.0025974025974025974  dimension of latent space:  6
Number of used gaussians: 10000
Optimizing scale only
volume too large: change size of output volumes. (If you want the original box size for the output volumes use a bigger gpu. The size of tensor a (384) must match the size of tensor b (192) at non-singleton dimension 2

and

/xxx/miniforge/envs/relion-5.0/lib/python3.10/site-packages/dynamight/models/decoder.py:235: UserWarning: Using a target size (torch.Size([192, 192, 192])) that is different to the input size (torch.Size([384, 384, 384])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size.
  loss = torch.nn.functional.mse_loss(

As I don't have access to a bigger gpu I made the following change:

--- decoder.py.orig 2024-11-06 09:02:03.000000000 +0000
+++ decoder.py  2024-11-06 09:02:26.000000000 +0000
@@ -224,7 +224,7 @@
         print('Optimizing scale only')
         optimizer = torch.optim.Adam(
             [self.image_smoother.A], lr=100*lr)
-        if reference_volume.shape[-1] > 360:
+        if reference_volume.shape[-1] > 384:
             reference_volume = torch.nn.functional.avg_pool3d(
                 reference_volume.unsqueeze(0).unsqueeze(0), 2)
             reference_volume = reference_volume.squeeze()

and the errors went away. I think my earlier apology was premature.

huwjenkins commented 2 days ago

The job with the modified dynamight/models/decoder.py is still running and is currently using ~21 GB of the 80 GB on the H100 GPU.

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.08             Driver Version: 535.161.08   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA H100 PCIe               Off | 00000000:21:00.0 Off |                    0 |
| N/A   74C    P0             221W / 310W |  21301MiB / 81559MiB |     79%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
huwjenkins commented 2 days ago

So I believe the underlying bug is the failure to update self.vol_box around here:

--- dynamight/models/decoder.py.orig    2024-11-06 09:02:03.000000000 +0000
+++ dynamight/models/decoder.py 2024-11-06 16:35:26.000000000 +0000
@@ -228,6 +228,7 @@
             reference_volume = torch.nn.functional.avg_pool3d(
                 reference_volume.unsqueeze(0).unsqueeze(0), 2)
             reference_volume = reference_volume.squeeze()
+            self.vol_box//=2

         for i in range(n_epochs):
             optimizer.zero_grad()

which is then used in generate_consensus_volume() here:

https://github.com/3dem/DynaMight/blob/eef4aa673af6cc908042b38646ae489ee8f2fde9/dynamight/models/decoder.py#L368-L375

However, I don't think that this is the most optimal way to deal with large boxes. If DynaMight has a cliff edge limit of 360 px then this should be documented and users advised to crop/downscale their particles appropriately. I could easily trim 12px from the edges of my particle boxes and other users with > 360 px boxes might also prefer to downsample to this size over automatic 2x downsampling?