Closed ckolluru closed 11 months ago
Downgrading the packages to the following versions (MONAI version: 1.1.0, Pytorch version: 1.13.0+cu117) seems to start a few epochs, but it stops with the same StopIteration exception. Maybe unrelated, the validation dice seems to be wrong?
python train.py -fold 0 -train_num_workers 1 -interval 2 -num_samples 3 -learning_rate 1e-2 -max_epochs 10 -task_id 11 -pos_sample_num 2 -expr_name baseline -tta_val True -determinism_flag False
MONAI version: 1.1.0
Numpy version: 1.26.2
Pytorch version: 1.13.0+cu117
MONAI flags: HAS_EXT = False, USE_COMPILED = False, USE_META_DICT = False
MONAI rev id: a2ec3752f54bfc3b40e7952234fbeb5452ed63e3
MONAI __file__: /home/paperspace/chaitanya/nnunet-monai/dynunet/dynunet-monai/monai-test/lib/python3.9/site-packages/monai/__init__.py
Optional dependencies:
Pytorch Ignite version: 0.4.11
Nibabel version: 5.1.0
scikit-image version: 0.22.0
Pillow version: 10.1.0
Tensorboard version: 2.15.1
gdown version: 4.7.1
TorchVision version: 0.14.0+cu117
tqdm version: 4.66.1
lmdb version: 1.4.1
psutil version: 5.9.6
pandas version: 2.1.3
einops version: 0.7.0
transformers version: 4.21.3
mlflow version: 2.8.1
pynrrd version: 1.0.0
For details about installing the optional dependencies, please visit:
https://docs.monai.io/en/latest/installation.html#installing-the-recommended-dependencies
Using deterministic training.
Loading dataset: 100%|███████████████████████████████████████████████████████████████| 6/6 [01:07<00:00, 11.19s/it]
Loading dataset: 100%|█████████████████████████████████████████████████████████████| 23/23 [02:57<00:00, 7.71s/it]
2023-11-28 16:21:20,580 - Engine run resuming from iteration 0, epoch 0 until 10 epochs
2023-11-28 16:21:27,267 - Epoch: 1/10, Iter: 1/11 -- train_loss: 3.4716
2023-11-28 16:21:29,121 - Epoch: 1/10, Iter: 2/11 -- train_loss: 3.2662
2023-11-28 16:21:31,055 - Epoch: 1/10, Iter: 3/11 -- train_loss: 2.9583
2023-11-28 16:21:34,246 - Epoch: 1/10, Iter: 4/11 -- train_loss: 2.5861
2023-11-28 16:21:37,004 - Epoch: 1/10, Iter: 5/11 -- train_loss: 2.2946
2023-11-28 16:21:39,601 - Epoch: 1/10, Iter: 6/11 -- train_loss: 2.0117
2023-11-28 16:21:42,791 - Epoch: 1/10, Iter: 7/11 -- train_loss: 1.7486
2023-11-28 16:21:45,789 - Epoch: 1/10, Iter: 8/11 -- train_loss: 1.5167
2023-11-28 16:21:47,839 - Epoch: 1/10, Iter: 9/11 -- train_loss: 1.3580
2023-11-28 16:21:50,349 - Epoch: 1/10, Iter: 10/11 -- train_loss: 1.2415
2023-11-28 16:21:52,977 - Epoch: 1/10, Iter: 11/11 -- train_loss: 1.1498
2023-11-28 16:21:52,977 - Epoch[1] Complete. Time taken: 00:00:32.346
2023-11-28 16:21:58,698 - Epoch: 2/10, Iter: 1/11 -- train_loss: 1.1017
2023-11-28 16:22:01,570 - Epoch: 2/10, Iter: 2/11 -- train_loss: 1.0517
2023-11-28 16:22:04,695 - Epoch: 2/10, Iter: 3/11 -- train_loss: 1.0439
2023-11-28 16:22:07,892 - Epoch: 2/10, Iter: 4/11 -- train_loss: 1.0185
2023-11-28 16:22:11,334 - Epoch: 2/10, Iter: 5/11 -- train_loss: 1.0074
2023-11-28 16:22:13,932 - Epoch: 2/10, Iter: 6/11 -- train_loss: 0.9781
2023-11-28 16:22:16,629 - Epoch: 2/10, Iter: 7/11 -- train_loss: 1.0450
2023-11-28 16:22:19,596 - Epoch: 2/10, Iter: 8/11 -- train_loss: 1.0459
2023-11-28 16:22:22,267 - Epoch: 2/10, Iter: 9/11 -- train_loss: 0.9988
2023-11-28 16:22:24,623 - Epoch: 2/10, Iter: 10/11 -- train_loss: 0.9807
2023-11-28 16:22:26,896 - Epoch: 2/10, Iter: 11/11 -- train_loss: 0.9819
2023-11-28 16:22:26,897 - Engine run resuming from iteration 0, epoch 1 until 2 epochs
2023-11-28 16:33:43,937 - Got new best metric of val_mean_dice: 0.0
2023-11-28 16:33:43,938 - Epoch[2] Metrics -- val_mean_dice: 0.0000
2023-11-28 16:33:43,938 - Key metric: val_mean_dice best value: 0.0 at epoch: 2
2023-11-28 16:33:44,428 - Epoch[2] Complete. Time taken: 00:11:17.470
2023-11-28 16:33:44,428 - Engine run complete. Time taken: 00:11:17.531
2023-11-28 16:33:44,514 - Epoch[2] Complete. Time taken: 00:11:51.537
2023-11-28 16:33:49,351 - Epoch: 3/10, Iter: 1/11 -- train_loss: 1.0017
2023-11-28 16:33:51,770 - Epoch: 3/10, Iter: 2/11 -- train_loss: 1.0032
2023-11-28 16:33:55,352 - Epoch: 3/10, Iter: 3/11 -- train_loss: 1.0049
2023-11-28 16:33:56,569 - Current run is terminating due to exception: Caught KeyError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/home/paperspace/chaitanya/nnunet-monai/dynunet/dynunet-monai/monai-test/lib/python3.9/site-packages/ignite/engine/engine.py", line 1032, in _run_once_on_dataset_as_gen
self.state.batch = next(self._dataloader_iter)
File "/home/paperspace/chaitanya/nnunet-monai/dynunet/dynunet-monai/monai-test/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 628, in __next__
data = self._next_data()
File "/home/paperspace/chaitanya/nnunet-monai/dynunet/dynunet-monai/monai-test/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1306, in _next_data
raise StopIteration
StopIteration
Hi @ckolluru, for the keyerror in DataLoader, you must “ensure that the key you’re trying to access in the dictionary exists”. If the key is missing, you need to either add it to the dictionary or change the code to use a different key that exists. Could you please check your data? Thanks!
Describe the bug I'm trying to run the DynUNet pipeline from the tutorials on a custom CT dataset. I've set it up as task 11, in a format similar to the medical segmentation decathlon datasets. A StopIterator exception is raised. Could you let me know if I'm missing a certain step in the training pipeline (described below) or passing incorrect arguments to the train.py script? Thanks.
To Reproduce Steps to reproduce the behavior:
Expected behavior The training script should load the datasets and start training on the dataset.
Environment Printing MONAI config.. MONAI version: 1.2.0 Numpy version: 1.23.2 Pytorch version: 2.0.1+cu117 MONAI flags: HAS_EXT = False, USE_COMPILED = False, USE_META_DICT = False MONAI rev id: c33f1ba588ee00229a309000e888f9817b4f1934 MONAI file: /home/paperspace/.local/lib/python3.9/site-packages/monai/init.py
Optional dependencies: Pytorch Ignite version: 0.4.11 ITK version: 5.3.0 Nibabel version: 5.1.0 scikit-image version: 0.19.3 Pillow version: 9.2.0 Tensorboard version: 2.9.1 gdown version: 4.5.1 TorchVision version: 0.15.2+cu117 tqdm version: 4.64.1 lmdb version: NOT INSTALLED or UNKNOWN VERSION. psutil version: 5.9.4 pandas version: 1.4.4 einops version: 0.7.0 transformers version: 4.21.3 mlflow version: NOT INSTALLED or UNKNOWN VERSION. pynrrd version: NOT INSTALLED or UNKNOWN VERSION.
For details about installing the optional dependencies, please visit: https://docs.monai.io/en/latest/installation.html#installing-the-recommended-dependencies
Printing system config.. System: Linux Linux version: Ubuntu 20.04.6 LTS Platform: Linux-5.4.0-166-generic-x86_64-with-glibc2.31 Processor: x86_64 Machine: x86_64 Python version: 3.9.16 Process name: python Command: ['python', '-c', 'import monai; monai.config.print_debug_info()'] Open files: [popenfile(path='/home/paperspace/.vscode-server/data/logs/20231128T133856/remoteagent.log', fd=19, position=2219, mode='a', flags=33793), popenfile(path='/home/paperspace/.vscode-server/data/logs/20231128T133856/ptyhost.log', fd=20, position=2013, mode='a', flags=33793), popenfile(path='/home/paperspace/.vscode-server/data/logs/20231128T133856/network.log', fd=25, position=0, mode='a', flags=33793)] Num physical CPUs: 8 Num logical CPUs: 8 Num usable CPUs: 8 CPU usage (%): [78.8, 57.6, 92.5, 72.5, 97.6, 63.2, 46.4, 43.0] CPU freq. (MHz): 3200 Load avg. in last 1, 5, 15 mins (%): [62.5, 39.5, 16.4] Disk usage (%): 82.6 Avg. sensor temp. (Celsius): UNKNOWN for given OS Total physical memory (GB): 44.1 Available memory (GB): 17.9 Used memory (GB): 25.3
Printing GPU config.. Num GPUs: 1 Has CUDA: True CUDA version: 11.7 cuDNN enabled: True cuDNN version: 8500 Current device: 0 Library compiled for CUDA architectures: ['sm_37', 'sm_50', 'sm_60', 'sm_70', 'sm_75', 'sm_80', 'sm_86'] GPU 0 Name: NVIDIA RTX A6000 GPU 0 Is integrated: False GPU 0 Is multi GPU board: False GPU 0 Multi processor count: 84 GPU 0 Total memory (GB): 47.5 GPU 0 CUDA capability (maj.min): 8.6