StanfordMIMI / DDM2

[ICLR2023] Official repository of DDM2: Self-Supervised Diffusion MRI Denoising with Generative Diffusion Models
140 stars 21 forks source link

Inference Error when denoising all volumes #14

Closed anudeepk17 closed 1 year ago

anudeepk17 commented 1 year ago

Hello, Firstly thanks for this great paper and detailed git. I am training the network for DCE-MRI after adding noise explicitly to the 4D data. My dataset is 320x320. I was successfully able to train it for stage 1 and stage 2. In stage 3 I am facing the error in model.py , line 223

self.optG.load_state_dict(opt['optimizer'])

The error being: loaded state dict contains a parameter group that doesn't match the size of optimizer's group I delved deeper and found "initial_lr" missing in the self.optG.state_dict()['param_groups'] and loaded dict opt['optimizer']['param_groups'] had it. I though the issue is that a new optimizer is being initialized and a trained optimizer is being loaded so, I added a line after line 65 in model.py

65| self.optG = torch.optim.Adam( optim_params, lr=opt['train']["optimizer"]["lr"])

Line added:

66 | self.scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(self.optG, opt['train']['n_iter'], eta_min=opt['train']["optimizer"]["lr"])

after this addition I saw both the self.optG and opt['optimizer'] have same size and parameter groups yet the error persists. Am I missing something, or was my approach wrong. The changes I did for my purpose : I had to change the image_size to 320 in .json files and uncommented the resize line in transform in mri_dataset.py because I did not want to downsize my data and had to reduce batch size to 2 for my training purposes.

I thank you in advance for your time.

anudeepk17 commented 1 year ago

Saw the fix to that being resume_state set to null, now facing another issue:

/srv/home/kumar256/.conda/envs/ddm2/lib/python3.7/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.

warnings.warn('Was asked to gather along dimension 0, but all ' Traceback (most recent call last): File "train_diff_model.py", line 76, in diffusion.optimize_parameters() File "/srv/home/kumar256/DDM2/DDM2/model/model.py", line 92, in optimize_parameters total_loss.backward() File "/srv/home/kumar256/.conda/envs/ddm2/lib/python3.7/site-packages/torch/_tensor.py", line 489, in backward self, gradient, retain_graph, create_graph, inputs=inputs File "/srv/home/kumar256/.conda/envs/ddm2/lib/python3.7/site-packages/torch/autograd/init.py", line 190, in backward gradtensors = _make_grads(tensors, gradtensors, is_grads_batched=False) File "/srv/home/kumar256/.conda/envs/ddm2/lib/python3.7/site-packages/torch/autograd/init.py", line 85, in _make_grads raise RuntimeError("grad can be implicitly created only for scalar outputs") RuntimeError: grad can be implicitly created only for scalar outputs

anudeepk17 commented 1 year ago

Issue Resolved, was the problem of me using 2 gpu ids.

anudeepk17 commented 1 year ago

I was able to train all stages, now I wanted to denoise the whole dataset [320x320x128x28]. I set the

dataset_opt['val_volume_idx']='all'

But in the json while training my validation mask was [10,28] due to which I believe I was getting a denoised data of size [320x320x128x18]

I wanted to denoise the whole data so I changed valid mask to [0,28] but I got this error:

2303 done 3584 to go!! Traceback (most recent call last): File "denoise.py", line 76, in for step, val_data in enumerate(val_loader): File "/srv/home/kumar256/.conda/envs/ddm2/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 628, in next data = self._next_data() File "/srv/home/kumar256/.conda/envs/ddm2/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1333, in _next_data return self._process_data(data) File "/srv/home/kumar256/.conda/envs/ddm2/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1359, in _process_data data.reraise() File "/srv/home/kumar256/.conda/envs/ddm2/lib/python3.7/site-packages/torch/_utils.py", line 543, in reraise raise exception KeyError: Caught KeyError in DataLoader worker process 0. Original Traceback (most recent call last): File "/srv/home/kumar256/.conda/envs/ddm2/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop data = fetcher.fetch(index) File "/srv/home/kumar256/.conda/envs/ddm2/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 58, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/srv/home/kumar256/.conda/envs/ddm2/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 58, in data = [self.dataset[idx] for idx in possibly_batched_index] File "/srv/home/kumar256/DDM2/DDM2/data/mri_dataset.py", line 133, in getitem ret['matched_state'] = torch.zeros(1,) + self.matched_state[volume_idx][slice_idx] KeyError: 18

Am I missing any step for denoising, any changes I needed to make in .json? I have kept the resume_state in path section as the path to the stage 3 model resume_state in the noise_model as the stage1 model and stage2file as the path to stage2file

tiangexiang commented 1 year ago

Hi! Thanks for your interest in our work! The error simply indicates that the matched state for index 18 (19th slice) cannot be found in the stage2 processed file. This is an expected error since you trained on only 18 slices for all three stages (including stage2). A quick fix could be to rerun stage2 with the correct validation mask [0, 28] instead of [10, 28], however, the denoising quality for the first 10 slices may not be guaranteed (since they were not trained in stage1 and 3). Another solution is to train everything from scratch again, with the correct validation mask of course.

anudeepk17 commented 1 year ago

So would just like to confirm

  1. That we cannot use the model we obtained in stage3 for denoising any other data than what we trained it on. For any new data we need to train all the three stages.
  2. Should I change resume_state in noise model section to stage3 model?
  3. For stage 3 we need to keep resume state as null in path section. Thank you for your great work.
tiangexiang commented 1 year ago
  1. Yes. Our algorithm is an optimization-based method, it cannot be (or can poorly) generalized to unseen data points.
  2. For stage3 training, you need to specify stage1 resume_state. For inference, you don't need to.
  3. For initiating stage3 training, you can keep resume_state as null (which means training stage3 model from scratch). For resume training or inference, you need to change resume_state to your checkpoint.
anudeepk17 commented 1 year ago

Thank you again for your time and help.