Closed anudeepk17 closed 1 year ago
Saw the fix to that being resume_state set to null, now facing another issue:
/srv/home/kumar256/.conda/envs/ddm2/lib/python3.7/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
warnings.warn('Was asked to gather along dimension 0, but all ' Traceback (most recent call last): File "train_diff_model.py", line 76, in
diffusion.optimize_parameters() File "/srv/home/kumar256/DDM2/DDM2/model/model.py", line 92, in optimize_parameters total_loss.backward() File "/srv/home/kumar256/.conda/envs/ddm2/lib/python3.7/site-packages/torch/_tensor.py", line 489, in backward self, gradient, retain_graph, create_graph, inputs=inputs File "/srv/home/kumar256/.conda/envs/ddm2/lib/python3.7/site-packages/torch/autograd/init.py", line 190, in backward gradtensors = _make_grads(tensors, gradtensors, is_grads_batched=False) File "/srv/home/kumar256/.conda/envs/ddm2/lib/python3.7/site-packages/torch/autograd/init.py", line 85, in _make_grads raise RuntimeError("grad can be implicitly created only for scalar outputs") RuntimeError: grad can be implicitly created only for scalar outputs
Issue Resolved, was the problem of me using 2 gpu ids.
I was able to train all stages, now I wanted to denoise the whole dataset [320x320x128x28]. I set the
dataset_opt['val_volume_idx']='all'
But in the json while training my validation mask was [10,28] due to which I believe I was getting a denoised data of size [320x320x128x18]
I wanted to denoise the whole data so I changed valid mask to [0,28] but I got this error:
2303 done 3584 to go!! Traceback (most recent call last): File "denoise.py", line 76, in
for step, val_data in enumerate(val_loader): File "/srv/home/kumar256/.conda/envs/ddm2/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 628, in next data = self._next_data() File "/srv/home/kumar256/.conda/envs/ddm2/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1333, in _next_data return self._process_data(data) File "/srv/home/kumar256/.conda/envs/ddm2/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1359, in _process_data data.reraise() File "/srv/home/kumar256/.conda/envs/ddm2/lib/python3.7/site-packages/torch/_utils.py", line 543, in reraise raise exception KeyError: Caught KeyError in DataLoader worker process 0. Original Traceback (most recent call last): File "/srv/home/kumar256/.conda/envs/ddm2/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop data = fetcher.fetch(index) File "/srv/home/kumar256/.conda/envs/ddm2/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 58, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/srv/home/kumar256/.conda/envs/ddm2/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 58, in data = [self.dataset[idx] for idx in possibly_batched_index] File "/srv/home/kumar256/DDM2/DDM2/data/mri_dataset.py", line 133, in getitem ret['matched_state'] = torch.zeros(1,) + self.matched_state[volume_idx][slice_idx] KeyError: 18
Am I missing any step for denoising, any changes I needed to make in .json? I have kept the resume_state in path section as the path to the stage 3 model resume_state in the noise_model as the stage1 model and stage2file as the path to stage2file
Hi! Thanks for your interest in our work! The error simply indicates that the matched state for index 18 (19th slice) cannot be found in the stage2 processed file. This is an expected error since you trained on only 18 slices for all three stages (including stage2). A quick fix could be to rerun stage2 with the correct validation mask [0, 28] instead of [10, 28], however, the denoising quality for the first 10 slices may not be guaranteed (since they were not trained in stage1 and 3). Another solution is to train everything from scratch again, with the correct validation mask of course.
So would just like to confirm
Thank you again for your time and help.
Hello, Firstly thanks for this great paper and detailed git. I am training the network for DCE-MRI after adding noise explicitly to the 4D data. My dataset is 320x320. I was successfully able to train it for stage 1 and stage 2. In stage 3 I am facing the error in model.py , line 223
The error being: loaded state dict contains a parameter group that doesn't match the size of optimizer's group I delved deeper and found "initial_lr" missing in the self.optG.state_dict()['param_groups'] and loaded dict opt['optimizer']['param_groups'] had it. I though the issue is that a new optimizer is being initialized and a trained optimizer is being loaded so, I added a line after line 65 in model.py
after this addition I saw both the self.optG and opt['optimizer'] have same size and parameter groups yet the error persists. Am I missing something, or was my approach wrong. The changes I did for my purpose : I had to change the image_size to 320 in .json files and uncommented the resize line in transform in mri_dataset.py because I did not want to downsize my data and had to reduce batch size to 2 for my training purposes.
I thank you in advance for your time.