MIC-DKFZ / nnUNet

Apache License 2.0
5.77k stars 1.73k forks source link

Weight shape and input mismatch #327

Closed sbajpai2 closed 4 years ago

sbajpai2 commented 4 years ago

Hi Fabian,

I ran 3d_cascade_fullres architecture for liver task (Task003_Liver). All of the folds ran successfully except for 4. When I continued training (-c) for it, it gave me error :

RuntimeError: Given groups=1, weight of size [32, 3, 3, 3, 3], expected input[2, 2, 128, 128, 128] to have 3 channels, but got 2 channels instead

Also I pulled nnunet repo before continuing. After that I am running into this error.

Can you please help me out with it?

Best, Shivam Bajpai

FabianIsensee commented 4 years ago

Hi, if there was a problem I would not be able to load the weights of my pretrained models - and I just verified that that works without problems. Have you done any modifications to nnU-Net? Best, Fabian

sbajpai2 commented 4 years ago

So to cross check, I did a fresh installation of nnunet by pulling the repository. After that I started training from scratch. Still I am getting this error.

2020-09-17 18:29:24.720177: epoch: 0 Traceback (most recent call last): File "/pylon5/bc5phlp/sbajpai2/environments/myenv/bin/nnUNet_train", line 33, in <module> sys.exit(load_entry_point('nnunet', 'console_scripts', 'nnUNet_train')()) File "/pylon5/bc5phlp/sbajpai2/nnUNet/nnunet/run/run_training.py", line 142, in main trainer.run_training() File "/pylon5/bc5phlp/sbajpai2/nnUNet/nnunet/training/network_training/nnUNetTrainerV2.py", line 420, in run_training ret = super().run_training() File "/pylon5/bc5phlp/sbajpai2/nnUNet/nnunet/training/network_training/nnUNetTrainer.py", line 316, in run_training super(nnUNetTrainer, self).run_training() File "/pylon5/bc5phlp/sbajpai2/nnUNet/nnunet/training/network_training/network_trainer.py", line 440, in run_training l = self.run_iteration(self.tr_gen, True) File "/pylon5/bc5phlp/sbajpai2/nnUNet/nnunet/training/network_training/nnUNetTrainerV2.py", line 238, in run_iteration output = self.network(data) File "/pylon5/bc5phlp/sbajpai2/environments/myenv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/pylon5/bc5phlp/sbajpai2/nnUNet/nnunet/network_architecture/generic_UNet.py", line 391, in forward x = self.conv_blocks_context[d](x) File "/pylon5/bc5phlp/sbajpai2/environments/myenv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/pylon5/bc5phlp/sbajpai2/nnUNet/nnunet/network_architecture/generic_UNet.py", line 142, in forward return self.blocks(x) File "/pylon5/bc5phlp/sbajpai2/environments/myenv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/pylon5/bc5phlp/sbajpai2/environments/myenv/lib/python3.6/site-packages/torch/nn/modules/container.py", line 117, in forward input = module(input) File "/pylon5/bc5phlp/sbajpai2/environments/myenv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/pylon5/bc5phlp/sbajpai2/nnUNet/nnunet/network_architecture/generic_UNet.py", line 65, in forward x = self.conv(x) File "/pylon5/bc5phlp/sbajpai2/environments/myenv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/pylon5/bc5phlp/sbajpai2/environments/myenv/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 567, in forward self.padding, self.dilation, self.groups) RuntimeError: Given groups=1, weight of size [32, 3, 3, 3, 3], expected input[2, 2, 128, 128, 128] to have 3 channels, but got 2 channels instead

I also looked into the pred_next_stage from 3d_lowres. The shape has only one channel.

So 3d_cascade_fullres should take 2 channels as input right? One from pred_next_stage from 3d_lowres and one from liver scan (stage 1) right? If thats the case then weight size should be [32,2,3,3,3] not [32,3,3,3,3].

Best, Shivam Bajpai

FabianIsensee commented 4 years ago

Hi, indeed there was a bug that I did not fing for whatever reason. Please pip install --upgrade nnunet (or install the most recent master) Best, Fabian

FabianIsensee commented 4 years ago

(pytorch_build) fabian@e230-AMDworkstation:~$ nnUNet_train 3d_cascade_fullres nnUNetTrainerV2CascadeFullRes 3 0

Please cite the following paper when using nnUNet: Fabian Isensee, Paul F. Jäger, Simon A. A. Kohl, Jens Petersen, Klaus H. Maier-Hein "Automated Design of Deep Learning Methods for Biomedical Image Segmentation" arXiv preprint arXiv:1904.08128 (2020). If you have questions or suggestions, feel free to open an issue at https://github.com/MIC-DKFZ/nnUNet

############################################### I am running the following nnUNet: 3d_cascade_fullres My trainer class is: <class 'nnunet.training.network_training.nnUNetTrainerV2_CascadeFullRes.nnUNetTrainerV2CascadeFullRes'> For that I will be using the following configuration: num_classes: 2 modalities: {0: 'CT'} use_mask_for_norm OrderedDict([(0, False)]) keep_only_largest_region None min_region_size_per_class None min_size_per_class None normalization_schemes OrderedDict([(0, 'CT')]) stages...

stage: 0 {'batch_size': 2, 'num_pool_per_axis': [5, 5, 5], 'patch_size': array([128, 128, 128]), 'median_patient_size_in_voxels': array([195, 207, 207]), 'current_spacing': array([2.473119 , 1.89831205, 1.89831205]), 'original_spacing': array([1. , 0.76757812, 0.76757812]), 'do_dummy_2D_data_aug': False, 'pool_op_kernel_sizes': [[2, 2, 2], [2, 2, 2], [2, 2, 2], [2, 2, 2], [2, 2, 2]], 'conv_kernel_sizes': [[3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3]]}

stage: 1 {'batch_size': 2, 'num_pool_per_axis': [5, 5, 5], 'patch_size': array([128, 128, 128]), 'median_patient_size_in_voxels': array([482, 512, 512]), 'current_spacing': array([1. , 0.76757812, 0.76757812]), 'original_spacing': array([1. , 0.76757812, 0.76757812]), 'do_dummy_2D_data_aug': False, 'pool_op_kernel_sizes': [[2, 2, 2], [2, 2, 2], [2, 2, 2], [2, 2, 2], [2, 2, 2]], 'conv_kernel_sizes': [[3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3]]}

I am using stage 1 from these plans I am using batch dice + CE loss

I am using data from this folder: /home/fabian/data/nnUNet_preprocessed/Task003_Liver/nnUNetData_plans_v2.1 ############################################### loading dataset loading all case properties unpacking dataset done 2020-09-18 16:14:27.579287: lr was set to: 0.01 using pin_memory on device 0 using pin_memory on device 0 2020-09-18 16:14:40.946933: Unable to plot network architecture: 2020-09-18 16:14:40.947266: No module named 'hiddenlayer' 2020-09-18 16:14:40.947313: printing the network instead: [...]

2020-09-18 16:14:40.951343: epoch: 0 2020-09-18 16:17:06.532563: train loss : -0.3209 2020-09-18 16:17:17.107053: validation loss: -0.5771 2020-09-18 16:17:17.107497: Average global foreground Dice: [0.961544559158761, 0.8243685444972272] 2020-09-18 16:17:17.107538: (interpret this as an estimate for the Dice of the different classes. This is not exact.) 2020-09-18 16:17:17.397450: lr was set to: 0.009991 2020-09-18 16:17:17.397569: This epoch took 156.446099 s

2020-09-18 16:17:17.397602: epoch: 1 2020-09-18 16:19:32.113349: train loss : -0.4897 2020-09-18 16:19:42.707690: validation loss: -0.5771 2020-09-18 16:19:42.708007: Average global foreground Dice: [0.9679650378973282, 0.8088953765940735] 2020-09-18 16:19:42.708044: (interpret this as an estimate for the Dice of the different classes. This is not exact.) 2020-09-18 16:19:43.117329: lr was set to: 0.009982 2020-09-18 16:19:43.117451: This epoch took 145.719821 s

2020-09-18 16:19:43.117483: epoch: 2 2020-09-18 16:21:58.111564: train loss : -0.5459 2020-09-18 16:22:09.410944: validation loss: -0.6545 2020-09-18 16:22:09.411342: Average global foreground Dice: [0.9624385802566687, 0.8355435323758208] 2020-09-18 16:22:09.411381: (interpret this as an estimate for the Dice of the different classes. This is not exact.) 2020-09-18 16:22:09.760079: lr was set to: 0.009973 2020-09-18 16:22:09.798120: saving checkpoint... 2020-09-18 16:22:10.108958: done, saving took 0.35 seconds 2020-09-18 16:22:10.118228: This epoch took 147.000715 s

This is on a 2080 ti, you should hopefully get similar epoch times. If not make sure to use cudnn 8.0.2 (either build pytorch yourself or use newest NGC docker). And make sure you got the I/O and CPU you need

sbajpai2 commented 4 years ago

Thanks Fabian. I will cross check it.

I have one small query. So do i have to run validation with --npz for all the architectures in order to find best configuration? When I ran: nnUNet_find_best_configuration -m 3d_fullres 3d_lowres 3d_cascade_fullres -t 003 --strict

I ended up with Assertion error saying npz seem to be missing.

Best, Shivam

FabianIsensee commented 4 years ago

Hi, yes you need to run validation with --npz to do that. Best, Fabian