Open 1alebra2 opened 9 months ago
It sounds like you found a solution. What was the issue?
I'm glad to hear this repository has been useful for you. David
thanks for your answer I added some obscure lines to the train.py that helped. I will send them as soon as I know more. Yes the program is running, But unfortunately there are other things I do not understand All the below is baased on using the following command line :python /home/abra/aktuelle_NETS/3DUnetCNN_new/unet3d/scripts/train.py --nthreads 12 --config_filename test_brats2020_config.json --output_dir ./out the dataset is brats 2020 but reduced to 10 training and 10 validation sets.. the test_brats2020_config.json file was generated with the create_config_lowmem.ipynb
I changed the prefetch to 2 as default, because lib/python3.10/site-packages/torch/utils/data/dataloader.py is medddling around with a lower prefetch. This seems to do the job.
- when I change the number of folds to 1 e.g. in the brats2020.json file the reader goes back to the default of 5 because it claims not find any any n_folds. so this is obviously not read properly.
There was a bug in the example configuration file and the example jupyter notebook that used "folds" instead of "n_folds" and so the number of folds was not being set properly. PR #337 fixed this issue for me. That being said, using "n_folds" of 1 causes a different error. So if you want to not run cross validation just take the "cross_validation" key out of the configuration file.
- in the bratsvalidation folder of the folds there are only single mri sequences of the respective cases, not 4 as I would expect?
The input and output images are 4D with the channels in the last image dimension. If you load the image using nibabel and print the shape for the brats2020 example the shape will be (128, 128, 128, 4), with the "4" referring to the 4 input images.
Cear ellis Thank you very much for your help, which do not take for granted.
The runtime error seems to be an issue of GPU memory which seems just to be enough, sometimes not (2080ti). I will check all that and keep you posted
excuse the typos :)
Thanks for rewriting the program which has a much better data input. the first version was my workhorse over the last years. I installed the new version and started the program as suggested. After some time I get the following message Validation: [71/73] Time 0.352 ( 0.359) Loss 5.7630e-01 (3.0972e-01) Validation: [72/73] Time 0.353 ( 0.358) Loss 3.5241e-01 (3.1031e-01) Validation: [73/73] Time 0.361 ( 0.359) Loss 7.5337e-01 (3.1638e-01) Epoch: [4][ 1/296] Time 1.281 ( 1.281) Data 0.479 ( 0.479) Loss 1.9667e-01 (1.9667e-01) Epoch: [4][ 2/296] Time 1.934 ( 1.607) Data 0.033 ( 0.256) Loss 2.9035e-01 (2.4351e-01) File "/home//aktuelle_NETS/3DUnetCNN_new/unet3d/scripts/train.py", line 177, in
main()
File "/home//aktuelle_NETS/3DUnetCNN_new/unet3d/scripts/train.py", line 173, in main
run(config_filename, output_dir, namespace)
File "/home//aktuelle_NETS/3DUnetCNN_new/unet3d/scripts/train.py", line 76, in run
run(_config_filename, work_dir, namespace)
File "/home//aktuelle_NETS/3DUnetCNN_new/unet3d/scripts/train.py", line 131, in run
run_training(model=model.train(), optimizer=optimizer, criterion=criterion,
File "/home//aktuelle_NETS/3DUnetCNN_new/unet3d/train/train.py", line 55, in run_training
losses.append(epoch_training(training_loader, model, criterion, optimizer=optimizer, epoch=epoch,
File "/home//aktuelle_NETS/3DUnetCNN_new/unet3d/train/training_utils.py", line 40, in epoch_training
for i, item in enumerate(train_loader):
File "/home//anaconda3/envs/ellise_new/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 652, in next
data = self._next_data()
File "/home//anaconda3/envs/ellise_new/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1330, in _next_data
idx, data = self._get_data()
File "/home//anaconda3/envs/ellise_new/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1296, in _get_data
success, data = self._try_get_data()
File "/home//anaconda3/envs/ellise_new/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1134, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/home//anaconda3/envs/ellise_new/lib/python3.10/multiprocessing/queues.py", line 122, in get
return _ForkingPickler.loads(res)
File "/home//anaconda3/envs/ellise_new/lib/python3.10/site-packages/torch/multiprocessing/reductions.py", line 297, in rebuild_storage_fd
fd = df.detach()
File "/home//anaconda3/envs/ellise_new/lib/python3.10/multiprocessing/resource_sharer.py", line 58, in detach
return reduction.recv_handle(conn)
File "/home//anaconda3/envs/ellise_new/lib/python3.10/multiprocessing/reduction.py", line 189, in recv_handle
return recvfds(s, 1)[0]
File "/home//anaconda3/envs/ellise_new/lib/python3.10/multiprocessing/reduction.py", line 164, in recvfds
raise RuntimeError('received %d items of ancdata' %
I tried another run with less data: after some time the following popped up: poch: [250][7/8] Time 1.849 ( 1.829) Data 0.008 ( 0.107) Loss 1.1240e-01 (1.4461e-01) Epoch: [250][8/8] Time 1.847 ( 1.831) Data 0.009 ( 0.095) Loss 1.1719e-01 (1.4118e-01) Validation: [1/2] Time 0.715 ( 0.715) Loss 1.6420e-01 (1.6420e-01) Validation: [2/2] Time 0.355 ( 0.535) Loss 2.2418e-01 (1.9419e-01) 2024-02-16 12:19:10,167 - root - DEBUG - Could not find value for key 'validation'; default to {} 2024-02-16 12:19:10,167 - root - DEBUG - Found value '1' for key 'validation_batch_size' 2024-02-16 12:19:10,167 - root - DEBUG - Could not find value for key 'prefetch_factor'; default to None 2024-02-16 12:19:10,167 - root - INFO - Found inference filenames: bratsvalidation (n=10) 2024-02-16 12:19:10,167 - root - DEBUG - Found value '12' for key 'n_workers' 2024-02-16 12:19:10,167 - root - DEBUG - Found value 'False' for key 'pin_memory' Traceback (most recent call last): File "/home//aktuelle_NETS/3DUnetCNN_new/unet3d/scripts/train.py", line 177, in
main()
File "/home//aktuelle_NETS/3DUnetCNN_new/unet3d/scripts/train.py", line 173, in main
run(config_filename, output_dir, namespace)
File "/home//aktuelle_NETS/3DUnetCNN_new/unet3d/scripts/train.py", line 76, in run
run(_config_filename, work_dir, namespace)
File "/home//aktuelle_NETS/3DUnetCNN_new/unet3d/scripts/train.py", line 149, in run
for _dataloader, _name in build_inference_loaders_from_config(config,
File "/home//aktuelle_NETS/3DUnetCNN_new/unet3d/scripts/script_utils.py", line 168, in build_inference_loaders_from_config
inference_dataloaders.append([build_inference_loader(filenames=config[key],
File "/home//aktuelle_NETS/3DUnetCNN_new/unet3d/scripts/script_utils.py", line 189, in build_inference_loader
_loader = DataLoader(_dataset,
File "/home//anaconda3/envs/ellise_new/lib/python3.10/site-packages/monai/data/dataloader.py", line 106, in init
super().init(dataset=dataset, num_workers=num_workers, **kwargs)
File "/home//anaconda3/envs/ellise_new/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 232, in init
assert prefetch_factor > 0
TypeError: '>' not supported between instances of 'NoneType' and 'int'
the starting command line was : python /home//aktuelle_NETS/3DUnetCNN_new/unet3d/scripts/train.py --nthreads 12 --config_filename test_brats2020_config.json --output_dir ./out
THis seems not related to the "netsoftwware itself" but to the python version. I have 3.10 in this conda env. What's the one you are using? Any other ideas. Thanks for the help Alex