RuntimeError: Expected size for first two dimensions of batch2 tensor to be: [104, 2] but got: [104, 3].

mgrageraz commented 8 months ago

Describe the bug Hi, there! I'm following the tutorial with the spliceosome dataset (EMPIAR-10180). I have completed the "prepare data" stage, running dsdsh prepare to get the .pkl files for ctf and euler angles. Also, I created single .star and .mrcs files with _relion_stackcreate, and everything OK so far. However, in the training step, I'm having the following error immediately after starting the first epoch.

current_ind: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] 0%| | 0/1538 [00:08<?, ?it/s] Traceback (most recent call last): File "/home/spa/anaconda3/envs/dsd/bin/dsd", line 8, in sys.exit(main()) File "/home/spa/scipion/software/em/opusDSD/cryodrgn/main.py", line 59, in main args.func(args) File "/home/spa/scipion/software/em/opusDSD/cryodrgn/commands/train_cv.py", line 955, in main train_batch(model, lattice, y, yt, rot, tran, optim, beta, File "/home/spa/scipion/software/em/opusDSD/cryodrgn/commands/train_cv.py", line 134, in train_batch z_mu, z_logstd, z, y_recon, y_recon_tilt, losses, y, y_ffts, mus, euler_samples, y_recon_ori, neg_mus, mask_sum = run_batch( File "/home/spa/scipion/software/em/opusDSD/cryodrgn/commands/train_cv.py", line 319, in run_batch z, encout = model.vanilla_encode(diff, rot, trans, eulers=euler, num_gpus=args.num_gpus, snr2=snr2) File "/home/spa/scipion/software/em/opusDSD/cryodrgn/models.py", line 225, in vanilla_encode encout = self.encoder(img, rots, trans, losslist=["kldiv"], eulers=eulers, snr=snr) File "/home/spa/anaconda3/envs/dsd/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, kwargs) File "/home/spa/anaconda3/envs/dsd/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/home/spa/anaconda3/envs/dsd/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/home/spa/anaconda3/envs/dsd/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply output.reraise() File "/home/spa/anaconda3/envs/dsd/lib/python3.9/site-packages/torch/_utils.py", line 457, in reraise raise exception RuntimeError: Caught RuntimeError in replica 0 on device 0. Original Traceback (most recent call last): File "/home/spa/anaconda3/envs/dsd/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker output = module(*input, *kwargs) File "/home/spa/anaconda3/envs/dsd/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(input, kwargs) File "/home/spa/scipion/software/em/opusDSD/cryodrgn/models.py", line 682, in forward x_fft = self.translate_ft2d(x_fft, -trans[i:i+1]self.render_size/self.vol_size) File "/home/spa/scipion/software/em/opusDSD/cryodrgn/models.py", line 612, in translate_ft2d tfilt = coords @ t 2 * np.pi # BxCxHxWx1 RuntimeError: Expected size for first two dimensions of batch2 tensor to be: [104, 2] but got: [104, 3].

To Reproduce dsd train_cv /home/spa/data/EMPIAR_10180/10180/data/Spliceosome_edit_mrcs/Runs/000210_ProtRelionExportParticles/Export/all.mrcs --ctf /home/spa/data/EMPIAR_10180/10180/data/Spliceosome_edit_mrcs/Runs/000210_ProtRelionExportParticles/Export/all_ctf.pkl --poses /home/spa/data/EMPIAR_10180/10180/data/Spliceosome_edit_mrcs/Runs/000210_ProtRelionExportParticles/Export/all_pose_euler.pkl --lazy-single --pe-type vanilla --encode-mode grad --template-type conv -n 20 -b 12 --zdim 12 --lr 1.e-4 --num-gpus 4 --multigpu --beta-control 2. --beta cos -o ./sp -r ./global_mask.mrc --downfrac 0.33 --valfrac 0.25 --lamb 1. --split sp-split.pkl --bfactor 4. --templateres 224

Expected behavior Since I followed all the steps required to enter the "training" stage, I was expecting to go through the 1st epoch. Before reaching this part, the terminal outputs that particles were successfully loaded into memory. Both ctf and euler .pkl files of the first particle are also outputted.

Additional context There is a slight modification in the input particle set compared with the tutorial. Apparently, there are a couple of images in the original .mrcs file deposited in EMPIAR that seems to be somehow corrupted, with pixels showing NaN values. Therefore, I created a subset of 100k randomly-selected particles that excluded the corrupted ones. It was this 100k-subset the one that went through the "prepare data" stage mentioned in the description of the bug.

Many thanks for your time, and looking forward to trying this amazing software! Best, Marcos.

alncat commented 8 months ago

@mgrageraz Hi Marcos, thank you very much for reporting this! It looks like the translation in pose parameter is of 3 dimensional, which should be 2-dimensional. Could you please send your star file to me? I will check if the prepare script produces the correct translations for each image. You can contact me via my email address (tluozhenwei@gmail.com).

alncat commented 8 months ago

Hi Marcos, it is caused by an extra rlnOriginZAngst column in the starfile. OPUS-DSD will treat translations in such starfile as being 3D and create a Nx3 array to store translations. You may drop that column using pyem.

mgrageraz commented 8 months ago

Hi @alncat ! Many thanks for your quick response. Now, I do not have that error anymore, so it worked! However, a new error happened. This was the command:

_dsd train_cv /home/spa/data/EMPIAR_10180/10180/data/Spliceosome_edit_mrcs/Runs/000210_ProtRelionExportParticles/Export/all_drop.mrcs --ctf /home/spa/data/EMPIAR_10180/10180/data/Spliceosome_edit_mrcs/Runs/000210_ProtRelionExportParticles/Export/all_ctf.pkl --poses /home/spa/data/EMPIAR_10180/10180/data/Spliceosome_edit_mrcs/Runs/000210_ProtRelionExportParticles/Export/all_drop_pose_euler.pkl --lazy-single --pe-type vanilla --encode-mode grad --template-type conv -n 20 -b 12 --zdim 12 --lr 1.e-4 --num-gpus 4 --multigpu --beta-control 2. --beta cos -o ./sp -r ./globalmask.mrc --downfrac 0.33 --valfrac 0.25 --lamb 1. --split sp-split.pkl --bfactor 4. --templateres 224

And this is the error:

2024-02-26 13:27:46 image will be downsampled to 0.33 of original size 320 2024-02-26 13:27:46 reconstruction will be blurred by bfactor 4.0 2024-02-26 13:27:46 learning rate [0.0001], bfactor: 4.333333333333333, beta_max: 1.0, beta_control: 2.0 for epoch 0 0%| | 0/1538 [00:00<?, ?it/s]ns: [720, 480, 576, 384, 1008, 480, 432, 384, 480, 1200, 624, 1296, 3744, 4848, 1008, 3120, 576, 432, 4560, 768, 2448, 672, 1824, 672, 336, 384, 1008, 480, 384, 1008, 384, 384, 6864, 7440, 3360, 1872, 1248, 624, 1632, 528, 336, 240, 240, 288, 1968, 4176, 720, 5184] current_ind: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] 2024-02-26 13:27:53 intializing multi_mu of 100000, 4 /home/spa/anaconda3/envs/dsd/lib/python3.9/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector. warnings.warn('Was asked to gather along dimension 0, but all ' 0%| | 0/1538 [00:08<?, ?it/s] Traceback (most recent call last): File "/home/spa/anaconda3/envs/dsd/bin/dsd", line 8, in sys.exit(main()) File "/home/spa/scipion/software/em/opusDSD/cryodrgn/main.py", line 59, in main args.func(args) File "/home/spa/scipion/software/em/opusDSD/cryodrgn/commands/train_cv.py", line 955, in main train_batch(model, lattice, y, yt, rot, tran, optim, beta, File "/home/spa/scipion/software/em/opusDSD/cryodrgn/commands/train_cv.py", line 147, in train_batch loss, gen_loss, snr, mu2, std2, mmd, c_mmd, top_euler, mse = loss_function(z_mu, z_logstd, y, yt, y_recon, File "/home/spa/scipion/software/em/opusDSD/cryodrgn/commands/train_cv.py", line 464, in loss_function assert torch.isnan(gen_loss).item() is False AssertionError

This error can be overcome by setting --num-gpus to 1 or by removing the --multigpu option. In that case, OPUS-DSD can start the first epoch. However, it would be great if I can make it run in more than one GPU at the same time to speed up the process. My workstation has 4 x Tesla T4 16 GB (that's why I have to downsample the particles so much, otherwise I get "out of memory" errors).

Many thanks for your time. Marcos.

alncat commented 8 months ago

Hi @mgrageraz , The consumption of memory can also be controlled by --templateres, which determines the size of the output volume. You may consider using smaller templateres like 160 (the output volume is of size 160), and downfrac like 0.5 (training image is of size 160), and then set -b to smaller value to make the model fit into your gpu. You may also need to use smaller learning rate for smaller batch size and image size. I am not clear why loss becomes NAN yet (It might indicate that learning rate is too large).

alncat commented 8 months ago

Hi @mgrageraz , I test the setting, templateres 160, downfrac 0.5, b 12 and num-gpus 4. It only consumes 10GB memories per gpu.

mgrageraz commented 8 months ago

Hi @alncat Thanks for the recommendations, it worked to save memory in GPU and to avoid downsampling too much. However, it only works for single GPU runs, whenever I enable the --multigpu option (trying different combinations of -b (from 4 to 12) and/or -lr (from 1.e-4 to 1.e-6), it throws the NAN error I pasted before.

Another observation is that, when the epoch starts in a single GPU, parameters like loss, mu, snr and std show nan values (see image attached, which corresponds to the first epoch) nan

Is it OK that they always show these nan values?

Best regards, Marcos.

alncat commented 8 months ago

Hi @mgrageraz , NAN in loss seems not be the expected behavior. Can you test the program using the test_down.mrcs and test*.pkls in the folder https://drive.google.com/drive/folders/1tEVu9PjCR-4pvkUK17fAHHpyw6y3rZcK?usp=sharing . I prepared these test files by selecting the first 5000 images from the consensus_data.star. The images in mrcs are downsampled to dimension 160x160 already, so you can set downfrac to 1.0 when using this dataset. Besides, maybe you should pull the latest code from this repository. I add a function to filter nan values in images before loading them in dataset.py (this can avoid NAN in image stacks being passed into the model)

alncat commented 8 months ago

This is the expected log.

mgrageraz commented 8 months ago

Hi, @alncat! Problem solved. I followed your instructions with this 5,000-ptcls dataset and still had nan values. We initially created an environment using environment.yml running cuda 10.2 and pytorch 1.11.0., throwing all these multigpu errors and nan values. We then updated to environmentcu11torch11.yml, with cuda 11.3 and not only the values were OK, but also was able to run in multiple GPUs. The test run completely finished and results could be analyzed successfully. Now I'll try with my own data, many thanks for your assistance.

alncat commented 8 months ago

@mgrageraz That sounds great! Thank you very much for reporting!

alncat / opusDSD

RuntimeError: Expected size for first two dimensions of batch2 tensor to be: [104, 2] but got: [104, 3]. #10