Project-MONAI / MONAI

AI Toolkit for Healthcare Imaging
https://monai.io/
Apache License 2.0
5.86k stars 1.08k forks source link

DynUNet training pipeline fails on custom dataset (StopIteration exception) #7264

Closed ckolluru closed 11 months ago

ckolluru commented 11 months ago

Describe the bug I'm trying to run the DynUNet pipeline from the tutorials on a custom CT dataset. I've set it up as task 11, in a format similar to the medical segmentation decathlon datasets. A StopIterator exception is raised. Could you let me know if I'm missing a certain step in the training pipeline (described below) or passing incorrect arguments to the train.py script? Thanks.

Exception has occurred: KeyError
Caught KeyError in DataLoader worker process 3.
Original Traceback (most recent call last):
  File "/home/paperspace/.local/lib/python3.9/site-packages/ignite/engine/engine.py", line 1032, in _run_once_on_dataset_as_gen
    self.state.batch = next(self._dataloader_iter)
  File "/home/paperspace/.local/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 633, in __next__
    data = self._next_data()
  File "/home/paperspace/.local/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1318, in _next_data
    raise StopIteration
StopIteration

To Reproduce Steps to reproduce the behavior:

  1. Setup data into imagesTr, labelsTr and imagesTs subfolders inside "Task11_XYZ" folder in the root data directory. CT data type is float32 and labels data type is uint8. Provide a dataset.json file describing the dataset.
  2. Run create_datalist.py to generate dataset_task11.json with the training/val splits for each of the five folds.
  3. Run calculate_task_params.py to get the normalization and clipping values for the custom dataset. Add those parameters to task_params.py
  4. Run train.py with the following arguments: "--fold", "0", "--train_num_workers", "4", "--interval", "2", "--num_samples", "3", "--learning_rate", "1e-2", "--max_epochs", "10", "--task_id", "11", "--pos_sample_num", "2", "--expr_name", "baseline", "--tta_val", "True", "--determinism_flag", "True", "--determinism_seed", "0"

Expected behavior The training script should load the datasets and start training on the dataset.

Environment Printing MONAI config.. MONAI version: 1.2.0 Numpy version: 1.23.2 Pytorch version: 2.0.1+cu117 MONAI flags: HAS_EXT = False, USE_COMPILED = False, USE_META_DICT = False MONAI rev id: c33f1ba588ee00229a309000e888f9817b4f1934 MONAI file: /home/paperspace/.local/lib/python3.9/site-packages/monai/init.py

Optional dependencies: Pytorch Ignite version: 0.4.11 ITK version: 5.3.0 Nibabel version: 5.1.0 scikit-image version: 0.19.3 Pillow version: 9.2.0 Tensorboard version: 2.9.1 gdown version: 4.5.1 TorchVision version: 0.15.2+cu117 tqdm version: 4.64.1 lmdb version: NOT INSTALLED or UNKNOWN VERSION. psutil version: 5.9.4 pandas version: 1.4.4 einops version: 0.7.0 transformers version: 4.21.3 mlflow version: NOT INSTALLED or UNKNOWN VERSION. pynrrd version: NOT INSTALLED or UNKNOWN VERSION.

For details about installing the optional dependencies, please visit: https://docs.monai.io/en/latest/installation.html#installing-the-recommended-dependencies

Printing system config.. System: Linux Linux version: Ubuntu 20.04.6 LTS Platform: Linux-5.4.0-166-generic-x86_64-with-glibc2.31 Processor: x86_64 Machine: x86_64 Python version: 3.9.16 Process name: python Command: ['python', '-c', 'import monai; monai.config.print_debug_info()'] Open files: [popenfile(path='/home/paperspace/.vscode-server/data/logs/20231128T133856/remoteagent.log', fd=19, position=2219, mode='a', flags=33793), popenfile(path='/home/paperspace/.vscode-server/data/logs/20231128T133856/ptyhost.log', fd=20, position=2013, mode='a', flags=33793), popenfile(path='/home/paperspace/.vscode-server/data/logs/20231128T133856/network.log', fd=25, position=0, mode='a', flags=33793)] Num physical CPUs: 8 Num logical CPUs: 8 Num usable CPUs: 8 CPU usage (%): [78.8, 57.6, 92.5, 72.5, 97.6, 63.2, 46.4, 43.0] CPU freq. (MHz): 3200 Load avg. in last 1, 5, 15 mins (%): [62.5, 39.5, 16.4] Disk usage (%): 82.6 Avg. sensor temp. (Celsius): UNKNOWN for given OS Total physical memory (GB): 44.1 Available memory (GB): 17.9 Used memory (GB): 25.3

Printing GPU config.. Num GPUs: 1 Has CUDA: True CUDA version: 11.7 cuDNN enabled: True cuDNN version: 8500 Current device: 0 Library compiled for CUDA architectures: ['sm_37', 'sm_50', 'sm_60', 'sm_70', 'sm_75', 'sm_80', 'sm_86'] GPU 0 Name: NVIDIA RTX A6000 GPU 0 Is integrated: False GPU 0 Is multi GPU board: False GPU 0 Multi processor count: 84 GPU 0 Total memory (GB): 47.5 GPU 0 CUDA capability (maj.min): 8.6

ckolluru commented 11 months ago

Downgrading the packages to the following versions (MONAI version: 1.1.0, Pytorch version: 1.13.0+cu117) seems to start a few epochs, but it stops with the same StopIteration exception. Maybe unrelated, the validation dice seems to be wrong?

python train.py -fold 0 -train_num_workers 1 -interval 2 -num_samples 3 -learning_rate 1e-2 -max_epochs 10 -task_id 11 -pos_sample_num 2 -expr_name baseline -tta_val True -determinism_flag False
MONAI version: 1.1.0
Numpy version: 1.26.2
Pytorch version: 1.13.0+cu117
MONAI flags: HAS_EXT = False, USE_COMPILED = False, USE_META_DICT = False
MONAI rev id: a2ec3752f54bfc3b40e7952234fbeb5452ed63e3
MONAI __file__: /home/paperspace/chaitanya/nnunet-monai/dynunet/dynunet-monai/monai-test/lib/python3.9/site-packages/monai/__init__.py

Optional dependencies:
Pytorch Ignite version: 0.4.11
Nibabel version: 5.1.0
scikit-image version: 0.22.0
Pillow version: 10.1.0
Tensorboard version: 2.15.1
gdown version: 4.7.1
TorchVision version: 0.14.0+cu117
tqdm version: 4.66.1
lmdb version: 1.4.1
psutil version: 5.9.6
pandas version: 2.1.3
einops version: 0.7.0
transformers version: 4.21.3
mlflow version: 2.8.1
pynrrd version: 1.0.0

For details about installing the optional dependencies, please visit:
    https://docs.monai.io/en/latest/installation.html#installing-the-recommended-dependencies

Using deterministic training.
Loading dataset: 100%|███████████████████████████████████████████████████████████████| 6/6 [01:07<00:00, 11.19s/it]
Loading dataset: 100%|█████████████████████████████████████████████████████████████| 23/23 [02:57<00:00,  7.71s/it]
2023-11-28 16:21:20,580 - Engine run resuming from iteration 0, epoch 0 until 10 epochs
2023-11-28 16:21:27,267 - Epoch: 1/10, Iter: 1/11 -- train_loss: 3.4716 
2023-11-28 16:21:29,121 - Epoch: 1/10, Iter: 2/11 -- train_loss: 3.2662 
2023-11-28 16:21:31,055 - Epoch: 1/10, Iter: 3/11 -- train_loss: 2.9583 
2023-11-28 16:21:34,246 - Epoch: 1/10, Iter: 4/11 -- train_loss: 2.5861 
2023-11-28 16:21:37,004 - Epoch: 1/10, Iter: 5/11 -- train_loss: 2.2946 
2023-11-28 16:21:39,601 - Epoch: 1/10, Iter: 6/11 -- train_loss: 2.0117 
2023-11-28 16:21:42,791 - Epoch: 1/10, Iter: 7/11 -- train_loss: 1.7486 
2023-11-28 16:21:45,789 - Epoch: 1/10, Iter: 8/11 -- train_loss: 1.5167 
2023-11-28 16:21:47,839 - Epoch: 1/10, Iter: 9/11 -- train_loss: 1.3580 
2023-11-28 16:21:50,349 - Epoch: 1/10, Iter: 10/11 -- train_loss: 1.2415 
2023-11-28 16:21:52,977 - Epoch: 1/10, Iter: 11/11 -- train_loss: 1.1498 
2023-11-28 16:21:52,977 - Epoch[1] Complete. Time taken: 00:00:32.346
2023-11-28 16:21:58,698 - Epoch: 2/10, Iter: 1/11 -- train_loss: 1.1017 
2023-11-28 16:22:01,570 - Epoch: 2/10, Iter: 2/11 -- train_loss: 1.0517 
2023-11-28 16:22:04,695 - Epoch: 2/10, Iter: 3/11 -- train_loss: 1.0439 
2023-11-28 16:22:07,892 - Epoch: 2/10, Iter: 4/11 -- train_loss: 1.0185 
2023-11-28 16:22:11,334 - Epoch: 2/10, Iter: 5/11 -- train_loss: 1.0074 
2023-11-28 16:22:13,932 - Epoch: 2/10, Iter: 6/11 -- train_loss: 0.9781 
2023-11-28 16:22:16,629 - Epoch: 2/10, Iter: 7/11 -- train_loss: 1.0450 
2023-11-28 16:22:19,596 - Epoch: 2/10, Iter: 8/11 -- train_loss: 1.0459 
2023-11-28 16:22:22,267 - Epoch: 2/10, Iter: 9/11 -- train_loss: 0.9988 
2023-11-28 16:22:24,623 - Epoch: 2/10, Iter: 10/11 -- train_loss: 0.9807 
2023-11-28 16:22:26,896 - Epoch: 2/10, Iter: 11/11 -- train_loss: 0.9819 
2023-11-28 16:22:26,897 - Engine run resuming from iteration 0, epoch 1 until 2 epochs
2023-11-28 16:33:43,937 - Got new best metric of val_mean_dice: 0.0
2023-11-28 16:33:43,938 - Epoch[2] Metrics -- val_mean_dice: 0.0000 
2023-11-28 16:33:43,938 - Key metric: val_mean_dice best value: 0.0 at epoch: 2
2023-11-28 16:33:44,428 - Epoch[2] Complete. Time taken: 00:11:17.470
2023-11-28 16:33:44,428 - Engine run complete. Time taken: 00:11:17.531
2023-11-28 16:33:44,514 - Epoch[2] Complete. Time taken: 00:11:51.537
2023-11-28 16:33:49,351 - Epoch: 3/10, Iter: 1/11 -- train_loss: 1.0017 
2023-11-28 16:33:51,770 - Epoch: 3/10, Iter: 2/11 -- train_loss: 1.0032 
2023-11-28 16:33:55,352 - Epoch: 3/10, Iter: 3/11 -- train_loss: 1.0049 
2023-11-28 16:33:56,569 - Current run is terminating due to exception: Caught KeyError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/paperspace/chaitanya/nnunet-monai/dynunet/dynunet-monai/monai-test/lib/python3.9/site-packages/ignite/engine/engine.py", line 1032, in _run_once_on_dataset_as_gen
    self.state.batch = next(self._dataloader_iter)
  File "/home/paperspace/chaitanya/nnunet-monai/dynunet/dynunet-monai/monai-test/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 628, in __next__
    data = self._next_data()
  File "/home/paperspace/chaitanya/nnunet-monai/dynunet/dynunet-monai/monai-test/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1306, in _next_data
    raise StopIteration
StopIteration
KumoLiu commented 11 months ago

Hi @ckolluru, for the keyerror in DataLoader, you must “ensure that the key you’re trying to access in the dictionary exists”. If the key is missing, you need to either add it to the dictionary or change the code to use a different key that exists. Could you please check your data? Thanks!