Open mgrageraz opened 8 months ago
@mgrageraz Hi Marcos, thank you very much for reporting this! It looks like the translation in pose parameter is of 3 dimensional, which should be 2-dimensional. Could you please send your star file to me? I will check if the prepare script produces the correct translations for each image. You can contact me via my email address (tluozhenwei@gmail.com).
Hi Marcos, it is caused by an extra rlnOriginZAngst column in the starfile. OPUS-DSD will treat translations in such starfile as being 3D and create a Nx3 array to store translations. You may drop that column using pyem.
Hi @alncat ! Many thanks for your quick response. Now, I do not have that error anymore, so it worked! However, a new error happened. This was the command:
_dsd train_cv /home/spa/data/EMPIAR_10180/10180/data/Spliceosome_edit_mrcs/Runs/000210_ProtRelionExportParticles/Export/all_drop.mrcs --ctf /home/spa/data/EMPIAR_10180/10180/data/Spliceosome_edit_mrcs/Runs/000210_ProtRelionExportParticles/Export/all_ctf.pkl --poses /home/spa/data/EMPIAR_10180/10180/data/Spliceosome_edit_mrcs/Runs/000210_ProtRelionExportParticles/Export/all_drop_pose_euler.pkl --lazy-single --pe-type vanilla --encode-mode grad --template-type conv -n 20 -b 12 --zdim 12 --lr 1.e-4 --num-gpus 4 --multigpu --beta-control 2. --beta cos -o ./sp -r ./globalmask.mrc --downfrac 0.33 --valfrac 0.25 --lamb 1. --split sp-split.pkl --bfactor 4. --templateres 224
And this is the error:
2024-02-26 13:27:46 image will be downsampled to 0.33 of original size 320
2024-02-26 13:27:46 reconstruction will be blurred by bfactor 4.0
2024-02-26 13:27:46 learning rate [0.0001], bfactor: 4.333333333333333, beta_max: 1.0, beta_control: 2.0 for epoch 0
0%| | 0/1538 [00:00<?, ?it/s]ns: [720, 480, 576, 384, 1008, 480, 432, 384, 480, 1200, 624, 1296, 3744, 4848, 1008, 3120, 576, 432, 4560, 768, 2448, 672, 1824, 672, 336, 384, 1008, 480, 384, 1008, 384, 384, 6864, 7440, 3360, 1872, 1248, 624, 1632, 528, 336, 240, 240, 288, 1968, 4176, 720, 5184]
current_ind: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
2024-02-26 13:27:53 intializing multi_mu of 100000, 4
/home/spa/anaconda3/envs/dsd/lib/python3.9/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
warnings.warn('Was asked to gather along dimension 0, but all '
0%| | 0/1538 [00:08<?, ?it/s]
Traceback (most recent call last):
File "/home/spa/anaconda3/envs/dsd/bin/dsd", line 8, in
This error can be overcome by setting --num-gpus to 1 or by removing the --multigpu option. In that case, OPUS-DSD can start the first epoch. However, it would be great if I can make it run in more than one GPU at the same time to speed up the process. My workstation has 4 x Tesla T4 16 GB (that's why I have to downsample the particles so much, otherwise I get "out of memory" errors).
Many thanks for your time. Marcos.
Hi @mgrageraz , The consumption of memory can also be controlled by --templateres, which determines the size of the output volume. You may consider using smaller templateres like 160 (the output volume is of size 160), and downfrac like 0.5 (training image is of size 160), and then set -b to smaller value to make the model fit into your gpu. You may also need to use smaller learning rate for smaller batch size and image size. I am not clear why loss becomes NAN yet (It might indicate that learning rate is too large).
Hi @mgrageraz , I test the setting, templateres 160, downfrac 0.5, b 12 and num-gpus 4. It only consumes 10GB memories per gpu.
Hi @alncat Thanks for the recommendations, it worked to save memory in GPU and to avoid downsampling too much. However, it only works for single GPU runs, whenever I enable the --multigpu option (trying different combinations of -b (from 4 to 12) and/or -lr (from 1.e-4 to 1.e-6), it throws the NAN error I pasted before.
Another observation is that, when the epoch starts in a single GPU, parameters like loss, mu, snr and std show nan values (see image attached, which corresponds to the first epoch)
Is it OK that they always show these nan values?
Best regards, Marcos.
Hi @mgrageraz , NAN in loss seems not be the expected behavior. Can you test the program using the test_down.mrcs and test*.pkls in the folder https://drive.google.com/drive/folders/1tEVu9PjCR-4pvkUK17fAHHpyw6y3rZcK?usp=sharing . I prepared these test files by selecting the first 5000 images from the consensus_data.star. The images in mrcs are downsampled to dimension 160x160 already, so you can set downfrac to 1.0 when using this dataset. Besides, maybe you should pull the latest code from this repository. I add a function to filter nan values in images before loading them in dataset.py (this can avoid NAN in image stacks being passed into the model)
This is the expected log.
Hi, @alncat!
Problem solved. I followed your instructions with this 5,000-ptcls dataset and still had nan values. We initially created an environment using environment.yml
running cuda 10.2 and pytorch 1.11.0., throwing all these multigpu errors and nan values. We then updated to environmentcu11torch11.yml
, with cuda 11.3 and not only the values were OK, but also was able to run in multiple GPUs. The test run completely finished and results could be analyzed successfully. Now I'll try with my own data, many thanks for your assistance.
@mgrageraz That sounds great! Thank you very much for reporting!
Describe the bug Hi, there! I'm following the tutorial with the spliceosome dataset (EMPIAR-10180). I have completed the "prepare data" stage, running dsdsh prepare to get the .pkl files for ctf and euler angles. Also, I created single .star and .mrcs files with _relion_stackcreate, and everything OK so far. However, in the training step, I'm having the following error immediately after starting the first epoch.
current_ind: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] 0%| | 0/1538 [00:08<?, ?it/s] Traceback (most recent call last): File "/home/spa/anaconda3/envs/dsd/bin/dsd", line 8, in
sys.exit(main())
File "/home/spa/scipion/software/em/opusDSD/cryodrgn/main.py", line 59, in main
args.func(args)
File "/home/spa/scipion/software/em/opusDSD/cryodrgn/commands/train_cv.py", line 955, in main
train_batch(model, lattice, y, yt, rot, tran, optim, beta,
File "/home/spa/scipion/software/em/opusDSD/cryodrgn/commands/train_cv.py", line 134, in train_batch
z_mu, z_logstd, z, y_recon, y_recon_tilt, losses, y, y_ffts, mus, euler_samples, y_recon_ori, neg_mus, mask_sum = run_batch(
File "/home/spa/scipion/software/em/opusDSD/cryodrgn/commands/train_cv.py", line 319, in run_batch
z, encout = model.vanilla_encode(diff, rot, trans, eulers=euler, num_gpus=args.num_gpus, snr2=snr2)
File "/home/spa/scipion/software/em/opusDSD/cryodrgn/models.py", line 225, in vanilla_encode
encout = self.encoder(img, rots, trans, losslist=["kldiv"], eulers=eulers, snr=snr)
File "/home/spa/anaconda3/envs/dsd/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, kwargs)
File "/home/spa/anaconda3/envs/dsd/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/spa/anaconda3/envs/dsd/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/spa/anaconda3/envs/dsd/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
output.reraise()
File "/home/spa/anaconda3/envs/dsd/lib/python3.9/site-packages/torch/_utils.py", line 457, in reraise
raise exception
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/home/spa/anaconda3/envs/dsd/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, *kwargs)
File "/home/spa/anaconda3/envs/dsd/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(input, kwargs)
File "/home/spa/scipion/software/em/opusDSD/cryodrgn/models.py", line 682, in forward
x_fft = self.translate_ft2d(x_fft, -trans[i:i+1]self.render_size/self.vol_size)
File "/home/spa/scipion/software/em/opusDSD/cryodrgn/models.py", line 612, in translate_ft2d
tfilt = coords @ t 2 * np.pi # BxCxHxWx1
RuntimeError: Expected size for first two dimensions of batch2 tensor to be: [104, 2] but got: [104, 3].
To Reproduce dsd train_cv /home/spa/data/EMPIAR_10180/10180/data/Spliceosome_edit_mrcs/Runs/000210_ProtRelionExportParticles/Export/all.mrcs --ctf /home/spa/data/EMPIAR_10180/10180/data/Spliceosome_edit_mrcs/Runs/000210_ProtRelionExportParticles/Export/all_ctf.pkl --poses /home/spa/data/EMPIAR_10180/10180/data/Spliceosome_edit_mrcs/Runs/000210_ProtRelionExportParticles/Export/all_pose_euler.pkl --lazy-single --pe-type vanilla --encode-mode grad --template-type conv -n 20 -b 12 --zdim 12 --lr 1.e-4 --num-gpus 4 --multigpu --beta-control 2. --beta cos -o ./sp -r ./global_mask.mrc --downfrac 0.33 --valfrac 0.25 --lamb 1. --split sp-split.pkl --bfactor 4. --templateres 224
Expected behavior Since I followed all the steps required to enter the "training" stage, I was expecting to go through the 1st epoch. Before reaching this part, the terminal outputs that particles were successfully loaded into memory. Both ctf and euler .pkl files of the first particle are also outputted.
Additional context There is a slight modification in the input particle set compared with the tutorial. Apparently, there are a couple of images in the original .mrcs file deposited in EMPIAR that seems to be somehow corrupted, with pixels showing NaN values. Therefore, I created a subset of 100k randomly-selected particles that excluded the corrupted ones. It was this 100k-subset the one that went through the "prepare data" stage mentioned in the description of the bug.
Many thanks for your time, and looking forward to trying this amazing software! Best, Marcos.