ellisdg / 3DUnetCNN

Pytorch 3D U-Net Convolution Neural Network (CNN) designed for medical image segmentation
MIT License
1.94k stars 651 forks source link

Runtime errror #335

Open 1alebra2 opened 9 months ago

1alebra2 commented 9 months ago

Thanks for rewriting the program which has a much better data input. the first version was my workhorse over the last years. I installed the new version and started the program as suggested. After some time I get the following message Validation: [71/73] Time 0.352 ( 0.359) Loss 5.7630e-01 (3.0972e-01) Validation: [72/73] Time 0.353 ( 0.358) Loss 3.5241e-01 (3.1031e-01) Validation: [73/73] Time 0.361 ( 0.359) Loss 7.5337e-01 (3.1638e-01) Epoch: [4][ 1/296] Time 1.281 ( 1.281) Data 0.479 ( 0.479) Loss 1.9667e-01 (1.9667e-01) Epoch: [4][ 2/296] Time 1.934 ( 1.607) Data 0.033 ( 0.256) Loss 2.9035e-01 (2.4351e-01) File "/home//aktuelle_NETS/3DUnetCNN_new/unet3d/scripts/train.py", line 177, in main() File "/home//aktuelle_NETS/3DUnetCNN_new/unet3d/scripts/train.py", line 173, in main run(config_filename, output_dir, namespace) File "/home//aktuelle_NETS/3DUnetCNN_new/unet3d/scripts/train.py", line 76, in run run(_config_filename, work_dir, namespace) File "/home//aktuelle_NETS/3DUnetCNN_new/unet3d/scripts/train.py", line 131, in run run_training(model=model.train(), optimizer=optimizer, criterion=criterion, File "/home//aktuelle_NETS/3DUnetCNN_new/unet3d/train/train.py", line 55, in run_training losses.append(epoch_training(training_loader, model, criterion, optimizer=optimizer, epoch=epoch, File "/home//aktuelle_NETS/3DUnetCNN_new/unet3d/train/training_utils.py", line 40, in epoch_training for i, item in enumerate(train_loader): File "/home//anaconda3/envs/ellise_new/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 652, in next data = self._next_data() File "/home//anaconda3/envs/ellise_new/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1330, in _next_data idx, data = self._get_data() File "/home//anaconda3/envs/ellise_new/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1296, in _get_data success, data = self._try_get_data() File "/home//anaconda3/envs/ellise_new/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1134, in _try_get_data data = self._data_queue.get(timeout=timeout) File "/home//anaconda3/envs/ellise_new/lib/python3.10/multiprocessing/queues.py", line 122, in get return _ForkingPickler.loads(res) File "/home//anaconda3/envs/ellise_new/lib/python3.10/site-packages/torch/multiprocessing/reductions.py", line 297, in rebuild_storage_fd fd = df.detach() File "/home//anaconda3/envs/ellise_new/lib/python3.10/multiprocessing/resource_sharer.py", line 58, in detach return reduction.recv_handle(conn) File "/home//anaconda3/envs/ellise_new/lib/python3.10/multiprocessing/reduction.py", line 189, in recv_handle return recvfds(s, 1)[0] File "/home//anaconda3/envs/ellise_new/lib/python3.10/multiprocessing/reduction.py", line 164, in recvfds raise RuntimeError('received %d items of ancdata' %

I tried another run with less data: after some time the following popped up: poch: [250][7/8] Time 1.849 ( 1.829) Data 0.008 ( 0.107) Loss 1.1240e-01 (1.4461e-01) Epoch: [250][8/8] Time 1.847 ( 1.831) Data 0.009 ( 0.095) Loss 1.1719e-01 (1.4118e-01) Validation: [1/2] Time 0.715 ( 0.715) Loss 1.6420e-01 (1.6420e-01) Validation: [2/2] Time 0.355 ( 0.535) Loss 2.2418e-01 (1.9419e-01) 2024-02-16 12:19:10,167 - root - DEBUG - Could not find value for key 'validation'; default to {} 2024-02-16 12:19:10,167 - root - DEBUG - Found value '1' for key 'validation_batch_size' 2024-02-16 12:19:10,167 - root - DEBUG - Could not find value for key 'prefetch_factor'; default to None 2024-02-16 12:19:10,167 - root - INFO - Found inference filenames: bratsvalidation (n=10) 2024-02-16 12:19:10,167 - root - DEBUG - Found value '12' for key 'n_workers' 2024-02-16 12:19:10,167 - root - DEBUG - Found value 'False' for key 'pin_memory' Traceback (most recent call last): File "/home//aktuelle_NETS/3DUnetCNN_new/unet3d/scripts/train.py", line 177, in main() File "/home//aktuelle_NETS/3DUnetCNN_new/unet3d/scripts/train.py", line 173, in main run(config_filename, output_dir, namespace) File "/home//aktuelle_NETS/3DUnetCNN_new/unet3d/scripts/train.py", line 76, in run run(_config_filename, work_dir, namespace) File "/home//aktuelle_NETS/3DUnetCNN_new/unet3d/scripts/train.py", line 149, in run for _dataloader, _name in build_inference_loaders_from_config(config, File "/home//aktuelle_NETS/3DUnetCNN_new/unet3d/scripts/script_utils.py", line 168, in build_inference_loaders_from_config inference_dataloaders.append([build_inference_loader(filenames=config[key], File "/home//aktuelle_NETS/3DUnetCNN_new/unet3d/scripts/script_utils.py", line 189, in build_inference_loader _loader = DataLoader(_dataset, File "/home//anaconda3/envs/ellise_new/lib/python3.10/site-packages/monai/data/dataloader.py", line 106, in init super().init(dataset=dataset, num_workers=num_workers, **kwargs) File "/home//anaconda3/envs/ellise_new/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 232, in init assert prefetch_factor > 0 TypeError: '>' not supported between instances of 'NoneType' and 'int'

the starting command line was : python /home//aktuelle_NETS/3DUnetCNN_new/unet3d/scripts/train.py --nthreads 12 --config_filename test_brats2020_config.json --output_dir ./out

THis seems not related to the "netsoftwware itself" but to the python version. I have 3.10 in this conda env. What's the one you are using? Any other ideas. Thanks for the help Alex

ellisdg commented 9 months ago

It sounds like you found a solution. What was the issue?

I'm glad to hear this repository has been useful for you. David

1alebra2 commented 9 months ago

thanks for your answer I added some obscure lines to the train.py that helped. I will send them as soon as I know more. Yes the program is running, But unfortunately there are other things I do not understand All the below is baased on using the following command line :python /home/abra/aktuelle_NETS/3DUnetCNN_new/unet3d/scripts/train.py --nthreads 12 --config_filename test_brats2020_config.json --output_dir ./out the dataset is brats 2020 but reduced to 10 training and 10 validation sets.. the test_brats2020_config.json file was generated with the create_config_lowmem.ipynb

  1. when I change the number of folds to 1 e.g. in the brats2020.json file the reader goes back to the default of 5 because it claims not find any any n_folds. so this is obviously not read properly.
  2. in the bratsvalidation folder of the folds there are only single mri sequences of the respective cases, not 4 as I would expect?
  3. I tried to do predictions with the command line as you proposed in the issue with the chinese text. I do not find a config.json file within the fold folders. I only have in the main directory of the folds folders files like fold1.json, folds2.json. When I use them with the following command "python /home/abra/aktuelle_NETS/3DUnetCNN_new/unet3d/scripts/predict.py --output_directory test_seg --config_filename /home/abra/aktuelle_NETS/3DUnetCNN_new/examples/brats2020/out/test_brats2020_config/fold5.json --model_filename /home/abra/aktuelle_NETS/3DUnetCNN_new/examples/brats2020/out/test_brats2020_config/fold5/model_best.pth --group validation" this does something, however uses only the Flair sequence, which is in the bratsvalidation folder . The result is some weird thing of 64 MB which reminds me of a probability density calculation, but no usual "segments- file".

I changed the prefetch to 2 as default, because lib/python3.10/site-packages/torch/utils/data/dataloader.py is medddling around with a lower prefetch. This seems to do the job.

  1. I would be happy if if I could that up running
ellisdg commented 9 months ago
  1. when I change the number of folds to 1 e.g. in the brats2020.json file the reader goes back to the default of 5 because it claims not find any any n_folds. so this is obviously not read properly.

There was a bug in the example configuration file and the example jupyter notebook that used "folds" instead of "n_folds" and so the number of folds was not being set properly. PR #337 fixed this issue for me. That being said, using "n_folds" of 1 causes a different error. So if you want to not run cross validation just take the "cross_validation" key out of the configuration file.

  1. in the bratsvalidation folder of the folds there are only single mri sequences of the respective cases, not 4 as I would expect?

The input and output images are 4D with the channels in the last image dimension. If you load the image using nibabel and print the shape for the brats2020 example the shape will be (128, 128, 128, 4), with the "4" referring to the 4 input images.

1alebra2 commented 9 months ago

Cear ellis Thank you very much for your help, which do not take for granted.

The runtime error seems to be an issue of GPU memory which seems just to be enough, sometimes not (2080ti). I will check all that and keep you posted

1alebra2 commented 9 months ago

excuse the typos :)