RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message

reza-akbari-movahed commented 4 months ago

Hi,

I hope you are well. I am trying to run nnUNet model. I could run it on Google without any problem. However, when I want to run it on a cluster, I get "RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message" error.

To train the model on cluster I use the following commands in the Linux's terminal of my cluster, respectively.

conda create -n nnunet python=3.10 conda activate nnunet conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia git clone https://github.com/MIC-DKFZ/nnUNet.git cd nnUNet pip install -e . pip install --upgrade git+https://github.com/FabianIsensee/hiddenlayer.git export nnUNet_raw="/nfs/primary/NNUNET Test 1/nnUNet_raw" export nnUNet_raw="/nfs/primary/NNUNET Test 1/nnUNet_raw" export nnUNet_preprocessed="/nfs/primary/NNUNET Test 1/nnUNet_preprocessed" export nnUNet_results="/nfs/primary/NNUNET Test 1/nnUNet_trained_models" export nnUNet_n_proc_DA=1 python "Dataset027_ACDC.py" -i "ACDC dataset/ACDC/database" nnUNetv2_plan_and_preprocess -d 027 --verify_dataset_integrity --verbose -c 2d -np 1 nnUNetv2_train Dataset027_ACDC 2d 4 --npz -device cuda

When I run the nnUNetv2_train command, I get the below error:

############################ INFO: You are using the old nnU-Net default plans. We have updated our recommendations. Please consider using those instead! Read more here: https://github.com/MIC-DKFZ/nnUNet/blob/master/documentation/resenc_presets.md ############################

Using device: cuda:0

####################################################################### Please cite the following paper when using nnU-Net: Isensee, F., Jaeger, P. F., Kohl, S. A., Petersen, J., & Maier-Hein, K. H. (2021). nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods, 18(2), 203-211. #######################################################################

2024-05-06 14:02:06.735503: do_dummy_2d_data_aug: False 2024-05-06 14:02:06.807714: Using splits from existing split file: /nfs/primary/NNUNET Test 1/nnUNet_preprocessed/Dataset027_ACDC/splits_final.json 2024-05-06 14:02:06.808884: The split file contains 5 splits. 2024-05-06 14:02:06.809508: Desired fold for training: 4 2024-05-06 14:02:06.810055: This split has 160 training and 40 validation cases. using pin_memory on device 0 using pin_memory on device 0 Exception in thread Thread-2 (results_loop): Traceback (most recent call last): File "/opt/miniconda3/envs/nnunet/lib/python3.10/threading.py", line 1016, in _bootstrap_inner self.run() File "/opt/miniconda3/envs/nnunet/lib/python3.10/threading.py", line 953, in run self._target(*self._args, *self._kwargs) File "/opt/miniconda3/envs/nnunet/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 125, in results_loop raise e File "/opt/miniconda3/envs/nnunet/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 103, in results_loop raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the " RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message Traceback (most recent call last): File "/opt/miniconda3/envs/nnunet/bin/nnUNetv2_train", line 8, in sys.exit(run_training_entry()) File "/nfs/primary/NNUNET Test 1/nnUNet/nnunetv2/run/run_training.py", line 275, in run_training_entry run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights, File "/nfs/primary/NNUNET Test 1/nnUNet/nnunetv2/run/run_training.py", line 211, in run_training nnunet_trainer.run_training() File "/nfs/primary/NNUNET Test 1/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 1338, in run_training self.on_train_start() File "/nfs/primary/NNUNET Test 1/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 882, in on_train_start self.dataloader_train, self.dataloader_val = self.get_dataloaders() File "/nfs/primary/NNUNET Test 1/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 676, in getdataloaders = next(mt_gen_val) File "/opt/miniconda3/envs/nnunet/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 196, in next item = self.__get_next_item() File "/opt/miniconda3/envs/nnunet/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 181, in __get_next_item raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the " RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message Exception in thread Thread-1 (results_loop): Traceback (most recent call last): File "/opt/miniconda3/envs/nnunet/lib/python3.10/threading.py", line 1016, in _bootstrap_inner self.run() File "/opt/miniconda3/envs/nnunet/lib/python3.10/threading.py", line 953, in run self._target(self._args, **self._kwargs) File "/opt/miniconda3/envs/nnunet/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 125, in results_loop raise e File "/opt/miniconda3/envs/nnunet/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 108, in results_loop item = in_queue.get() File "/opt/miniconda3/envs/nnunet/lib/python3.10/multiprocessing/queues.py", line 122, in get (nnunet) root@wild-addax-46:/nfs/primary/NNUNET Test 1/nnUNet# nnUNetv2_train Dataset027_ACDC 2d 4 --npz -device cuda

############################ INFO: You are using the old nnU-Net default plans. We have updated our recommendations. Please consider using those instead! Read more here: https://github.com/MIC-DKFZ/nnUNet/blob/master/documentation/resenc_presets.md ############################

Using device: cuda:0

####################################################################### Please cite the following paper when using nnU-Net: Isensee, F., Jaeger, P. F., Kohl, S. A., Petersen, J., & Maier-Hein, K. H. (2021). nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods, 18(2), 203-211. #######################################################################

2024-05-06 14:02:33.046479: do_dummy_2d_data_aug: False 2024-05-06 14:02:33.047981: Using splits from existing split file: /nfs/primary/NNUNET Test 1/nnUNet_preprocessed/Dataset027_ACDC/splits_final.json 2024-05-06 14:02:33.048665: The split file contains 5 splits. 2024-05-06 14:02:33.049192: Desired fold for training: 4 2024-05-06 14:02:33.049692: This split has 160 training and 40 validation cases. using pin_memory on device 0 using pin_memory on device 0 Exception in thread Thread-2 (results_loop): Traceback (most recent call last): File "/opt/miniconda3/envs/nnunet/lib/python3.10/threading.py", line 1016, in _bootstrap_inner self.run() File "/opt/miniconda3/envs/nnunet/lib/python3.10/threading.py", line 953, in run self._target(*self._args, *self._kwargs) File "/opt/miniconda3/envs/nnunet/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 125, in results_loop raise e File "/opt/miniconda3/envs/nnunet/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 103, in results_loop raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the " RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message Traceback (most recent call last): File "/opt/miniconda3/envs/nnunet/bin/nnUNetv2_train", line 8, in sys.exit(run_training_entry()) File "/nfs/primary/NNUNET Test 1/nnUNet/nnunetv2/run/run_training.py", line 275, in run_training_entry run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights, File "/nfs/primary/NNUNET Test 1/nnUNet/nnunetv2/run/run_training.py", line 211, in run_training nnunet_trainer.run_training() File "/nfs/primary/NNUNET Test 1/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 1338, in run_training self.on_train_start() File "/nfs/primary/NNUNET Test 1/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 882, in on_train_start self.dataloader_train, self.dataloader_val = self.get_dataloaders() File "/nfs/primary/NNUNET Test 1/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 676, in getdataloaders = next(mt_gen_val) File "/opt/miniconda3/envs/nnunet/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 196, in next item = self.__get_next_item() File "/opt/miniconda3/envs/nnunet/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 181, in __get_next_item raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the " RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message Exception in thread Thread-1 (results_loop): Traceback (most recent call last): File "/opt/miniconda3/envs/nnunet/lib/python3.10/threading.py", line 1016, in _bootstrap_inner self.run() File "/opt/miniconda3/envs/nnunet/lib/python3.10/threading.py", line 953, in run self._target(self._args, **self._kwargs) File "/opt/miniconda3/envs/nnunet/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 125, in results_loop raise e File "/opt/miniconda3/envs/nnunet/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 108, in results_loop item = in_queue.get() File "/opt/miniconda3/envs/nnunet/lib/python3.10/multiprocessing/queues.py", line 122, in get return _ForkingPickler.loads(res) File "/opt/miniconda3/envs/nnunet/lib/python3.10/site-packages/torch/multiprocessing/reductions.py", line 495, in rebuild_storage_fd

If possible, please help me how I can solve this issue. I have tested other methods such as, import os os.environ['OMP_NUM_THREADS']="1", but they do not work.

seziegler commented 4 months ago

Hi @reza-akbari-movahed , according to the warning it seems like the plans you are using are not the ones you've created earlier in the commands you're showing. Could it be that there are already some other plans present in the preprocessed folder that are not overwritten when using the nnUNetv2_plan_and_preprocess command? Another possibility is that you are running out of RAM during training, which usually leads to workers crashing, can you check that?

Best, Sebastian

reza-akbari-movahed commented 4 months ago

Hi @seziegler. Thank you for your response. No, whenever I want to run nnUNetv2_plan_and_preprocess and nnUNetv2_train, i delete the previous files in nnUNet_raw and nnUNet_preprocessed folders and the problem still exists. The details of my cluster's RAM are provided in the following:

free -h total used free shared buff/cache available Mem: 503Gi 39Gi 293Gi 36Mi 170Gi 461Gi Swap: 0B 0B 0B

seziegler commented 4 months ago

Hi @reza-akbari-movahed , can you monitor the RAM usage during your nnunet training?

reza-akbari-movahed commented 4 months ago

Yeah, I checked that. When I run nnUNetv2_train, the used RAM reached to 32538.9 from 384453.8 free space. What is the minimum RAM for training nnUnet model?

The below text if the error I get

############################ INFO: You are using the old nnU-Net default plans. We have updated our recommendations. Please consider using those instead! Read more here: https://github.com/MIC-DKFZ/nnUNet/blob/master/documentation/resenc_presets.md ############################

Using device: cuda:0

####################################################################### Please cite the following paper when using nnU-Net: Isensee, F., Jaeger, P. F., Kohl, S. A., Petersen, J., & Maier-Hein, K. H. (2021). nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods, 18(2), 203-211. #######################################################################

2024-05-07 12:53:37.255147: do_dummy_2d_data_aug: False 2024-05-07 12:53:37.258357: Using splits from existing split file: /nfs/primary/NNUNET Test 1/nnUNet_preprocessed/Dataset027_ACDC/splits_final.json 2024-05-07 12:53:37.260889: The split file contains 5 splits. 2024-05-07 12:53:37.262305: Desired fold for training: 4 2024-05-07 12:53:37.263711: This split has 160 training and 40 validation cases. using pin_memory on device 0 Traceback (most recent call last): File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/multiprocessing/queues.py", line 244, in _feed obj = _ForkingPickler.dumps(obj) File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/multiprocessing/reduction.py", line 51, in dumps cls(buf, protocol).dump(obj) File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/site-packages/torch/multiprocessing/reductions.py", line 568, in reduce_storage fd, size = storage._share_fdcpu() File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/site-packages/torch/storage.py", line 304, in wrapper return fn(self, *args, kwargs) File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/site-packages/torch/storage.py", line 374, in _share_fdcpu return super()._share_fdcpu(*args, *kwargs) RuntimeError: unable to write to file : No space left on device (28) Traceback (most recent call last): File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/multiprocessing/queues.py", line 244, in _feed obj = _ForkingPickler.dumps(obj) File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/multiprocessing/reduction.py", line 51, in dumps cls(buf, protocol).dump(obj) File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/site-packages/torch/multiprocessing/reductions.py", line 568, in reduce_storage fd, size = storage._share_fdcpu() File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/site-packages/torch/storage.py", line 304, in wrapper return fn(self, args, kwargs) File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/site-packages/torch/storage.py", line 374, in _share_fdcpu return super()._share_fdcpu(*args, kwargs) RuntimeError: unable to write to file : No space left on device (28) Traceback (most recent call last): File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/multiprocessing/queues.py", line 244, in _feed obj = _ForkingPickler.dumps(obj) File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/multiprocessing/reduction.py", line 51, in dumps cls(buf, protocol).dump(obj) File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/site-packages/torch/multiprocessing/reductions.py", line 568, in reduce_storage fd, size = storage._share_fdcpu() File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/site-packages/torch/storage.py", line 304, in wrapper return fn(self, *args, *kwargs) File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/site-packages/torch/storage.py", line 374, in _share_fdcpu return super()._share_fdcpu(args, kwargs) RuntimeError: unable to write to file : No space left on device (28) Traceback (most recent call last): File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/multiprocessing/queues.py", line 244, in _feed obj = _ForkingPickler.dumps(obj) File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/multiprocessing/reduction.py", line 51, in dumps cls(buf, protocol).dump(obj) File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/site-packages/torch/multiprocessing/reductions.py", line 568, in reduce_storage fd, size = storage._share_fdcpu() File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/site-packages/torch/storage.py", line 304, in wrapper return fn(self, *args, kwargs) File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/site-packages/torch/storage.py", line 374, in _share_fdcpu return super()._share_fdcpu(*args, *kwargs) RuntimeError: unable to write to file : No space left on device (28) Traceback (most recent call last): File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/multiprocessing/queues.py", line 244, in _feed obj = _ForkingPickler.dumps(obj) File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/multiprocessing/reduction.py", line 51, in dumps cls(buf, protocol).dump(obj) File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/site-packages/torch/multiprocessing/reductions.py", line 568, in reduce_storage fd, size = storage._share_fdcpu() File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/site-packages/torch/storage.py", line 304, in wrapper return fn(self, args, kwargs) File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/site-packages/torch/storage.py", line 374, in _share_fdcpu return super()._share_fdcpu(*args, kwargs) RuntimeError: unable to write to file : No space left on device (28) Traceback (most recent call last): File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/multiprocessing/queues.py", line 244, in _feed obj = _ForkingPickler.dumps(obj) File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/multiprocessing/reduction.py", line 51, in dumps cls(buf, protocol).dump(obj) File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/site-packages/torch/multiprocessing/reductions.py", line 568, in reduce_storage fd, size = storage._share_fdcpu() File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/site-packages/torch/storage.py", line 304, in wrapper return fn(self, *args, *kwargs) File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/site-packages/torch/storage.py", line 374, in _share_fdcpu return super()._share_fdcpu(args, kwargs) RuntimeError: unable to write to file : No space left on device (28) Traceback (most recent call last): File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/multiprocessing/queues.py", line 244, in _feed obj = _ForkingPickler.dumps(obj) File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/multiprocessing/reduction.py", line 51, in dumps cls(buf, protocol).dump(obj) File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/site-packages/torch/multiprocessing/reductions.py", line 568, in reduce_storage fd, size = storage._share_fdcpu() File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/site-packages/torch/storage.py", line 304, in wrapper return fn(self, *args, kwargs) File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/site-packages/torch/storage.py", line 374, in _share_fdcpu return super()._share_fdcpu(*args, *kwargs) RuntimeError: unable to write to file : No space left on device (28) Traceback (most recent call last): File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/multiprocessing/queues.py", line 244, in _feed obj = _ForkingPickler.dumps(obj) File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/multiprocessing/reduction.py", line 51, in dumps cls(buf, protocol).dump(obj) File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/site-packages/torch/multiprocessing/reductions.py", line 568, in reduce_storage fd, size = storage._share_fdcpu() File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/site-packages/torch/storage.py", line 304, in wrapper return fn(self, args, kwargs) File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/site-packages/torch/storage.py", line 374, in _share_fdcpu return super()._share_fdcpu(*args, kwargs) RuntimeError: unable to write to file : No space left on device (28) Traceback (most recent call last): File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/multiprocessing/queues.py", line 244, in _feed obj = _ForkingPickler.dumps(obj) File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/multiprocessing/reduction.py", line 51, in dumps cls(buf, protocol).dump(obj) File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/site-packages/torch/multiprocessing/reductions.py", line 568, in reduce_storage fd, size = storage._share_fdcpu() File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/site-packages/torch/storage.py", line 304, in wrapper return fn(self, *args, *kwargs) File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/site-packages/torch/storage.py", line 374, in _share_fdcpu return super()._share_fdcpu(args, kwargs) RuntimeError: unable to write to file : No space left on device (28) Traceback (most recent call last): File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/multiprocessing/queues.py", line 244, in _feed obj = _ForkingPickler.dumps(obj) File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/multiprocessing/reduction.py", line 51, in dumps cls(buf, protocol).dump(obj) File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/site-packages/torch/multiprocessing/reductions.py", line 568, in reduce_storage fd, size = storage._share_fdcpu() File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/site-packages/torch/storage.py", line 304, in wrapper return fn(self, *args, kwargs) File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/site-packages/torch/storage.py", line 374, in _share_fdcpu return super()._share_fdcpu(*args, *kwargs) RuntimeError: unable to write to file : No space left on device (28) Traceback (most recent call last): File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/multiprocessing/queues.py", line 244, in _feed obj = _ForkingPickler.dumps(obj) File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/multiprocessing/reduction.py", line 51, in dumps cls(buf, protocol).dump(obj) File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/site-packages/torch/multiprocessing/reductions.py", line 568, in reduce_storage fd, size = storage._share_fdcpu() File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/site-packages/torch/storage.py", line 304, in wrapper return fn(self, args, kwargs) File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/site-packages/torch/storage.py", line 374, in _share_fdcpu return super()._share_fdcpu(*args, kwargs) RuntimeError: unable to write to file : No space left on device (28) Traceback (most recent call last): File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/multiprocessing/queues.py", line 244, in _feed obj = _ForkingPickler.dumps(obj) File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/multiprocessing/reduction.py", line 51, in dumps cls(buf, protocol).dump(obj) File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/site-packages/torch/multiprocessing/reductions.py", line 568, in reduce_storage fd, size = storage._share_fdcpu() File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/site-packages/torch/storage.py", line 304, in wrapper return fn(self, *args, *kwargs) File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/site-packages/torch/storage.py", line 374, in _share_fdcpu return super()._share_fdcpu(args, kwargs) RuntimeError: unable to write to file : No space left on device (28) Traceback (most recent call last): File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/multiprocessing/queues.py", line 244, in _feed obj = _ForkingPickler.dumps(obj) File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/multiprocessing/reduction.py", line 51, in dumps cls(buf, protocol).dump(obj) File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/site-packages/torch/multiprocessing/reductions.py", line 568, in reduce_storage fd, size = storage._share_fdcpu() File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/site-packages/torch/storage.py", line 304, in wrapper return fn(self, *args, kwargs) File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/site-packages/torch/storage.py", line 374, in _share_fdcpu return super()._share_fdcpu(*args, *kwargs) RuntimeError: unable to write to file : No space left on device (28) Traceback (most recent call last): File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/multiprocessing/queues.py", line 244, in _feed obj = _ForkingPickler.dumps(obj) File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/multiprocessing/reduction.py", line 51, in dumps cls(buf, protocol).dump(obj) File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/site-packages/torch/multiprocessing/reductions.py", line 568, in reduce_storage fd, size = storage._share_fdcpu() File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/site-packages/torch/storage.py", line 304, in wrapper return fn(self, args, kwargs) File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/site-packages/torch/storage.py", line 374, in _share_fdcpu return super()._share_fdcpu(*args, kwargs) RuntimeError: unable to write to file : No space left on device (28) Traceback (most recent call last): File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/multiprocessing/queues.py", line 244, in _feed obj = _ForkingPickler.dumps(obj) File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/multiprocessing/reduction.py", line 51, in dumps cls(buf, protocol).dump(obj) File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/site-packages/torch/multiprocessing/reductions.py", line 568, in reduce_storage fd, size = storage._share_fdcpu() File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/site-packages/torch/storage.py", line 304, in wrapper return fn(self, *args, *kwargs) File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/site-packages/torch/storage.py", line 374, in _share_fdcpu return super()._share_fdcpu(args, kwargs) RuntimeError: unable to write to file : No space left on device (28) Traceback (most recent call last): File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/multiprocessing/queues.py", line 244, in _feed obj = _ForkingPickler.dumps(obj) File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/multiprocessing/reduction.py", line 51, in dumps cls(buf, protocol).dump(obj) File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/site-packages/torch/multiprocessing/reductions.py", line 568, in reduce_storage fd, size = storage._share_fdcpu() File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/site-packages/torch/storage.py", line 304, in wrapper return fn(self, *args, kwargs) File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/site-packages/torch/storage.py", line 374, in _share_fdcpu return super()._share_fdcpu(*args, *kwargs) RuntimeError: unable to write to file : No space left on device (28) Traceback (most recent call last): File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/multiprocessing/queues.py", line 244, in _feed obj = _ForkingPickler.dumps(obj) File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/multiprocessing/reduction.py", line 51, in dumps cls(buf, protocol).dump(obj) File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/site-packages/torch/multiprocessing/reductions.py", line 568, in reduce_storage fd, size = storage._share_fdcpu() File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/site-packages/torch/storage.py", line 304, in wrapper return fn(self, args, kwargs) File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/site-packages/torch/storage.py", line 374, in _share_fdcpu return super()._share_fdcpu(*args, kwargs) RuntimeError: unable to write to file : No space left on device (28) Exception in thread Thread-1 (results_loop): Traceback (most recent call last): File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/threading.py", line 1016, in _bootstrap_inner self.run() File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/threading.py", line 953, in run self._target(*self._args, *self._kwargs) File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 125, in results_loop raise e File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 103, in results_loop raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the " RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message using pin_memory on device 0 Traceback (most recent call last): File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/multiprocessing/queues.py", line 244, in _feed obj = _ForkingPickler.dumps(obj) File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/multiprocessing/reduction.py", line 51, in dumps cls(buf, protocol).dump(obj) File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/site-packages/torch/multiprocessing/reductions.py", line 568, in reduce_storage fd, size = storage._share_fdcpu() File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/site-packages/torch/storage.py", line 304, in wrapper return fn(self, args, kwargs) File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/site-packages/torch/storage.py", line 374, in _share_fdcpu return super()._share_fdcpu(*args, kwargs) RuntimeError: unable to write to file : No space left on device (28) Traceback (most recent call last): File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/multiprocessing/queues.py", line 244, in _feed obj = _ForkingPickler.dumps(obj) File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/multiprocessing/reduction.py", line 51, in dumps cls(buf, protocol).dump(obj) File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/site-packages/torch/multiprocessing/reductions.py", line 568, in reduce_storage fd, size = storage._share_fdcpu() File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/site-packages/torch/storage.py", line 304, in wrapper return fn(self, *args, *kwargs) File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/site-packages/torch/storage.py", line 374, in _share_fdcpu return super()._share_fdcpu(args, kwargs) RuntimeError: unable to write to file : No space left on device (28) Exception in thread Thread-2 (results_loop): Traceback (most recent call last): File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/threading.py", line 1016, in _bootstrap_inner self.run() File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/threading.py", line 953, in run self._target(*self._args, **self._kwargs) File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 125, in results_loop raise e File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 103, in results_loop raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the " RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message 2024-05-07 12:53:40.653679: Using torch.compile... /opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/site-packages/torch/optim/lr_scheduler.py:28: UserWarning: The verbose parameter is deprecated. Please use get_last_lr() to access the learning rate. warnings.warn("The verbose parameter is deprecated. Please use get_last_lr() "

This is the configuration used by this training: Configuration name: 2d {'data_identifier': 'nnUNetPlans_2d', 'preprocessor_name': 'DefaultPreprocessor', 'batch_size': 56, 'patch_size': [256, 224], 'median_image_size_in_voxels': [237.0, 208.0], 'spacing': [1.5625, 1.5625], 'normalization_schemes': ['ZScoreNormalization'], 'use_mask_for_norm': [False], 'resampling_fn_data': 'resample_data_or_seg_to_shape', 'resampling_fn_seg': 'resample_data_or_seg_to_shape', 'resampling_fn_data_kwargs': {'is_seg': False, 'order': 3, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_seg_kwargs': {'is_seg': True, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_probabilities': 'resample_data_or_seg_to_shape', 'resampling_fn_probabilities_kwargs': {'is_seg': False, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'architecture': {'network_class_name': 'dynamic_network_architectures.architectures.unet.PlainConvUNet', 'arch_kwargs': {'n_stages': 6, 'features_per_stage': [32, 64, 128, 256, 512, 512], 'conv_op': 'torch.nn.modules.conv.Conv2d', 'kernel_sizes': [[3, 3], [3, 3], [3, 3], [3, 3], [3, 3], [3, 3]], 'strides': [[1, 1], [2, 2], [2, 2], [2, 2], [2, 2], [2, 2]], 'n_conv_per_stage': [2, 2, 2, 2, 2, 2], 'n_conv_per_stage_decoder': [2, 2, 2, 2, 2], 'conv_bias': True, 'norm_op': 'torch.nn.modules.instancenorm.InstanceNorm2d', 'norm_op_kwargs': {'eps': 1e-05, 'affine': True}, 'dropout_op': None, 'dropout_op_kwargs': None, 'nonlin': 'torch.nn.LeakyReLU', 'nonlin_kwargs': {'inplace': True}, 'deep_supervision': True}, '_kw_requires_import': ['conv_op', 'norm_op', 'dropout_op', 'nonlin']}, 'batch_dice': True}

These are the global plan.json settings: {'dataset_name': 'Dataset027_ACDC', 'plans_name': 'nnUNetPlans', 'original_median_spacing_after_transp': [10.0, 1.5625, 1.5625], 'original_median_shape_after_transp': [9, 256, 216], 'image_reader_writer': 'SimpleITKIO', 'transpose_forward': [0, 1, 2], 'transpose_backward': [0, 1, 2], 'experiment_planner_used': 'ExperimentPlanner', 'label_manager': 'LabelManager', 'foreground_intensity_properties_per_channel': {'0': {'max': 1488.0, 'mean': 123.30044555664062, 'median': 99.0, 'min': 0.0, 'percentile_00_5': 24.0, 'percentile_99_5': 615.0, 'std': 92.96476745605469}}}

2024-05-07 12:53:47.293146: unpacking dataset... 2024-05-07 12:53:58.636277: unpacking done... 2024-05-07 12:53:58.722734: Unable to plot network architecture: nnUNet_compile is enabled! 2024-05-07 12:53:59.240499: 2024-05-07 12:53:59.242498: Epoch 0 2024-05-07 12:53:59.244107: Current learning rate: 0.01 Traceback (most recent call last): File "/opt/miniconda3/envs/nnUnet_scratch/bin/nnUNetv2_train", line 8, in sys.exit(run_training_entry()) File "/nfs/primary/NNUNET Test 1/nnUNet/nnunetv2/run/run_training.py", line 275, in run_training_entry run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights, File "/nfs/primary/NNUNET Test 1/nnUNet/nnunetv2/run/run_training.py", line 211, in run_training nnunet_trainer.run_training() File "/nfs/primary/NNUNET Test 1/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 1346, in run_training train_outputs.append(self.train_step(next(self.dataloader_train))) File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 196, in next item = self.get_next_item() File "/opt/miniconda3/envs/nnUnet_scratch/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 181, in get_next_item raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the " RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message

reza-akbari-movahed commented 4 months ago

This is also the screen shot of my Monitor usage during running nnUNetv2_train. RAM Monitoring

thangngoc89 commented 4 months ago

@reza-akbari-movahed you could try to reduce the number of worker threads by changing nnUNet_n_proc_DA environment variable. I rarely train 2d models so I have no ideas but for 3d models, the whole system consumes less than 25GB of system RAM

seziegler commented 4 months ago

Hi @reza-akbari-movahed , in the error message you've posted it says: RuntimeError: unable to write to file </torch_8865_3411536562_0>: No space left on device (28) Seems like you're running out of disk space on the cluster.

reza-akbari-movahed commented 4 months ago

@thangngoc89 Should I declare it like the below example in bashrc file?

export nnUNet_n_proc_DA="1"

thangngoc89 commented 4 months ago

@reza-akbari-movahed yeah like that if you want it to be permanent. Or just run the command before run nnUNetv2_train. Also, should be at least 8 threads. 1 would be too slow

simonansm commented 4 months ago

Hi I have set the nnUNet_n_proc_DA, still got the following error:

####################################################################### Please cite the following paper when using nnU-Net: Isensee, F., Jaeger, P. F., Kohl, S. A., Petersen, J., & Maier-Hein, K. H. (2021). nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods, 18(2), 203-211. #######################################################################

2024-05-19 04:27:07.832701: do_dummy_2d_data_aug: False 2024-05-19 04:27:07.834357: Using splits from existing split file: /Users/simonansm/nnUNet/nnUNetFrame/DATASET/nnUNet_preprocessed/Dataset001_BrainTumour/splits_final.json 2024-05-19 04:27:07.834540: The split file contains 5 splits. 2024-05-19 04:27:07.834577: Desired fold for training: 0 2024-05-19 04:27:07.834609: This split has 387 training and 97 validation cases. /Users/simonansm/nnUNet/nnunetv2/training/dataloading/data_loader_2d.py:107: UserWarning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (Triggered internally at /Users/runner/work/_temp/anaconda/conda-bld/pytorch_1711403251597/work/aten/src/ATen/ParallelNative.cpp:228.) torch.set_num_threads(torch_nthreads) /Users/simonansm/nnUNet/nnunetv2/training/dataloading/data_loader_2d.py:107: UserWarning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (Triggered internally at /Users/runner/work/_temp/anaconda/conda-bld/pytorch_1711403251597/work/aten/src/ATen/ParallelNative.cpp:228.) torch.set_num_threads(torch_nthreads) Traceback (most recent call last): File "/opt/anaconda3/bin/nnUNetv2_train", line 8, in sys.exit(run_training_entry()) ^^^^^^^^^^^^^^^^^^^^ File "/Users/simonansm/nnUNet/nnunetv2/run/run_training.py", line 275, in run_training_entry run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights, File "/Users/simonansm/nnUNet/nnunetv2/run/run_training.py", line 211, in run_training nnunet_trainer.run_training() File "/Users/simonansm/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 1338, in run_training self.on_train_start() File "/Users/simonansm/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 885, in on_train_start self.initialize() File "/Users/simonansm/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 217, in initialize ).to(self.device) ^^^^^^^^^^^^^^^ File "/opt/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1152, in to return self._apply(convert) ^^^^^^^^^^^^^^^^^^^^ File "/opt/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 802, in _apply module._apply(fn) File "/opt/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 802, in _apply module._apply(fn) File "/opt/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 802, in _apply module._apply(fn) [Previous line repeated 4 more times] File "/opt/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 825, in _apply param_applied = fn(param) ^^^^^^^^^ File "/opt/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1150, in convert return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/anaconda3/lib/python3.11/site-packages/torch/cuda/init.py", line 293, in _lazy_init raise AssertionError("Torch not compiled with CUDA enabled") AssertionError: Torch not compiled with CUDA enabled Exception in thread Thread-2 (results_loop): Traceback (most recent call last): File "/opt/anaconda3/lib/python3.11/threading.py", line 1045, in _bootstrap_inner self.run() File "/opt/anaconda3/lib/python3.11/threading.py", line 982, in run self._target(*self._args, self._kwargs) File "/opt/anaconda3/lib/python3.11/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 125, in results_loop raise e File "/opt/anaconda3/lib/python3.11/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 103, in results_loop raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the " RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message**

Could you plz help me with that? Thanks @thangngoc89

thangngoc89 commented 4 months ago

@simonansm the error is very visible. Your torch installation doesn't support CUDA.

Please see how to install torch with CUDA for your platform here https://pytorch.org/get-started/locally/

simonansm commented 4 months ago

Your torch installation doesn't support CUDA. @thangngoc89 [Mac M1, Anaconda, Python] I have re-installed torch with conda install pytorch::pytorch torchvision torchaudio -c pytorch from the website and still got the same error "AssertionError: Torch not compiled with CUDA enabled"

thangngoc89 commented 4 months ago

@simonansm are you running this on Mac M1? No CUDA there for you. You could try

import torch

torch.cuda.is_available()

simonansm commented 4 months ago

@simonansm are you running this on Mac M1? No CUDA there for you. You could try
import torch

torch.cuda.is_available()

@thangngoc89 I tried and it returned False, so no CUDA for M1. I have searched that -device mps should be used instead, but it is not supported by the current torch. According to https://github.com/pytorch/pytorch/issues/125254.

thangngoc89 commented 4 months ago

@simonansm I'm using M2. Honestly don't bother with training on M2. You may run inference there but training is dominated by Nvidia (and CUDA)

claudiab98 commented 4 months ago

Hello, I hope you are well. I am trying to run the nnUNet model on a Windows system (Intel(R) UHD Graphics 620, which does not support CUDA) and when I want to run it on anaconda, I get "RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message" error.

To train the model on anaconda I used the following code. Everything works fine until I start with the training. I would really appreaciate it, if you could help me.

(nnunet) C:\Users\claud\nnUNet\nnUNet>nnUNetv2_plan_and_preprocess -d 12 --verify_dataset_integrity Fingerprint extraction... Dataset012_BVSG Using <class 'nnunetv2.imageio.natural_image_reader_writer.NaturalImage2DIO'> as reader/writer

#################### verify_dataset_integrity Done. If you didn't see any error messages then your dataset is most likely OK! ####################

Using <class 'nnunetv2.imageio.natural_image_reader_writer.NaturalImage2DIO'> as reader/writer 100%|████████████████████████| 236/236 [00:06<00:00, 38.17it/s] Experiment planning...

############################ INFO: You are using the old nnU-Net default planner. We have updated our recommendations. Please consider using those instead! Read more here: https://github.com/MIC-DKFZ/nnUNet/blob/master/documentation/resenc_presets.md ############################

2D U-Net configuration: {'data_identifier': 'nnUNetPlans_2d', 'preprocessor_name': 'DefaultPreprocessor', 'batch_size': 7, 'patch_size': (640, 768), 'median_image_size_in_voxels': array([631., 663.]), 'spacing': array([1., 1.]), 'normalization_schemes': ['ZScoreNormalization'], 'use_mask_for_norm': [False], 'resampling_fn_data': 'resample_data_or_seg_to_shape', 'resampling_fn_seg': 'resample_data_or_seg_to_shape', 'resampling_fn_data_kwargs': {'is_seg': False, 'order': 3, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_seg_kwargs': {'is_seg': True, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_probabilities': 'resample_data_or_seg_to_shape', 'resampling_fn_probabilities_kwargs': {'is_seg': False, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'architecture': {'network_class_name': 'dynamic_network_architectures.architectures.unet.PlainConvUNet', 'arch_kwargs': {'n_stages': 8, 'features_per_stage': (32, 64, 128, 256, 512, 512, 512, 512), 'conv_op': 'torch.nn.modules.conv.Conv2d', 'kernel_sizes': ((3, 3), (3, 3), (3, 3), (3, 3), (3, 3), (3, 3), (3, 3), (3, 3)), 'strides': ((1, 1), (2, 2), (2, 2), (2, 2), (2, 2), (2, 2), (2, 2), (2, 2)), 'n_conv_per_stage': (2, 2, 2, 2, 2, 2, 2, 2), 'n_conv_per_stage_decoder': (2, 2, 2, 2, 2, 2, 2), 'conv_bias': True, 'norm_op': 'torch.nn.modules.instancenorm.InstanceNorm2d', 'norm_op_kwargs': {'eps': 1e-05, 'affine': True}, 'dropout_op': None, 'dropout_op_kwargs': None, 'nonlin': 'torch.nn.LeakyReLU', 'nonlin_kwargs': {'inplace': True}}, '_kw_requires_import': ('conv_op', 'norm_op', 'dropout_op', 'nonlin')}, 'batch_dice': True}

Using <class 'nnunetv2.imageio.natural_image_reader_writer.NaturalImage2DIO'> as reader/writer Plans were saved to C:/Users/claud/nnUNet/nnUNet_preprocessed\Dataset012_BVSG\nnUNetPlans.json Preprocessing... Preprocessing dataset Dataset012_BVSG Configuration: 2d... 100%|████████████████████████| 236/236 [00:26<00:00, 9.05it/s] Configuration: 3d_fullres... INFO: Configuration 3d_fullres not found in plans file nnUNetPlans.json of dataset Dataset012_BVSG. Skipping. Configuration: 3d_lowres... INFO: Configuration 3d_lowres not found in plans file nnUNetPlans.json of dataset Dataset012_BVSG. Skipping.

(nnunet) C:\Users\claud\nnUNet\nnUNet>nnUNetv2_train 12 2d 0 --npz

############################ INFO: You are using the old nnU-Net default plans. We have updated our recommendations. Please consider using those instead! Read more here: https://github.com/MIC-DKFZ/nnUNet/blob/master/documentation/resenc_presets.md ############################

Using device: cuda:0 C:\Users\claud\anacondaneu\envs\nnunet\lib\site-packages\torch\amp\grad_scaler.py:131: UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available. Disabling. warnings.warn(

####################################################################### Please cite the following paper when using nnU-Net: Isensee, F., Jaeger, P. F., Kohl, S. A., Petersen, J., & Maier-Hein, K. H. (2021). nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods, 18(2), 203-211. #######################################################################

2024-05-19 17:39:24.014512: do_dummy_2d_data_aug: False 2024-05-19 17:39:24.023368: Creating new 5-fold cross-validation split... 2024-05-19 17:39:24.044249: Desired fold for training: 0 2024-05-19 17:39:24.051214: This split has 188 training and 48 validation cases. Traceback (most recent call last): File "C:\Users\claud\anacondaneu\envs\nnunet\lib\runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Users\claud\anacondaneu\envs\nnunet\lib\runpy.py", line 87, in _run_code exec(code, run_globals) File "C:\Users\claud\anacondaneu\envs\nnunet\Scripts\nnUNetv2_train.exe__main.py", line 7, in File "C:\Users\claud\nnUNet\nnUNet\nnunetv2\run\run_training.py", line 275, in run_training_entry run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights, File "C:\Users\claud\nnUNet\nnUNet\nnunetv2\run\run_training.py", line 211, in run_training nnunet_trainer.run_training() File "C:\Users\claud\nnUNet\nnUNet\nnunetv2\training\nnUNetTrainer\nnUNetTrainer.py", line 1338, in run_training self.on_train_start() File "C:\Users\claud\nnUNet\nnUNet\nnunetv2\training\nnUNetTrainer\nnUNetTrainer.py", line 885, in on_train_start self.initialize() File "C:\Users\claud\nnUNet\nnUNet\nnunetv2\training\nnUNetTrainer\nnUNetTrainer.py", line 210, in initialize self.network = self.build_network_architecture( File "C:\Users\claud\anacondaneu\envs\nnunet\lib\site-packages\torch\nn\modules\module.py", line 1173, in to return self._apply(convert) File "C:\Users\claud\anacondaneu\envs\nnunet\lib\site-packages\torch\nn\modules\module.py", line 779, in _apply module._apply(fn) File "C:\Users\claud\anacondaneu\envs\nnunet\lib\site-packages\torch\nn\modules\module.py", line 779, in _apply module._apply(fn) File "C:\Users\claud\anacondaneu\envs\nnunet\lib\site-packages\torch\nn\modules\module.py", line 779, in _apply module._apply(fn) [Previous line repeated 4 more times] File "C:\Users\claud\anacondaneu\envs\nnunet\lib\site-packages\torch\nn\modules\module.py", line 804, in _apply param_applied = fn(param) File "C:\Users\claud\anacondaneu\envs\nnunet\lib\site-packages\torch\nn\modules\module.py", line 1159, in convert return t.to( File "C:\Users\claud\anacondaneu\envs\nnunet\lib\site-packages\torch\cuda\init__.py", line 284, in _lazy_init raise AssertionError("Torch not compiled with CUDA enabled") AssertionError: Torch not compiled with CUDA enabled Exception in thread Thread-2: Traceback (most recent call last): File "C:\Users\claud\anacondaneu\envs\nnunet\lib\threading.py", line 980, in _bootstrap_inner Exception in thread Thread-1: Traceback (most recent call last): File "C:\Users\claud\anacondaneu\envs\nnunet\lib\threading.py", line 980, in _bootstrap_inner self.run() self.run() File "C:\Users\claud\anacondaneu\envs\nnunet\lib\threading.py", line 917, in run File "C:\Users\claud\anacondaneu\envs\nnunet\lib\threading.py", line 917, in run self._target(*self._args, *self._kwargs) File "C:\Users\claud\anacondaneu\envs\nnunet\lib\site-packages\batchgenerators\dataloading\nondet_multi_threaded_augmenter.py", line 125, in results_loop self._target(self._args, **self._kwargs) File "C:\Users\claud\anacondaneu\envs\nnunet\lib\site-packages\batchgenerators\dataloading\nondet_multi_threaded_augmenter.py", line 125, in results_loop raise e File "C:\Users\claud\anacondaneu\envs\nnunet\lib\site-packages\batchgenerators\dataloading\nondet_multi_threaded_augmenter.py", line 103, in results_loop raise e File "C:\Users\claud\anacondaneu\envs\nnunet\lib\site-packages\batchgenerators\dataloading\nondet_multi_threaded_augmenter.py", line 103, in results_loop raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the " RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the " RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message

seziegler commented 4 months ago

Hi @claudiabuci , you need to have a CUDA capable device to run your training if you want to train on GPU. If you want to train on CPU (takes significantly longer) you can do that with -device cpu as a flag for the train command.

NastaranVB commented 3 months ago

@thangngoc89 Should I declare it like the below example in bashrc file?

export nnUNet_n_proc_DA="1"

@thangngoc89 Should I declare it like the below example in bashrc file?

export nnUNet_n_proc_DA="1"

Hi @reza-akbari-movahed , Could you overcome the issue and solve the error?

HuiLin0220 commented 2 months ago

When I docker run testing, I get this issue: No space left on device (28); but when I do Python run testing, it is fine. Are there any people who can kindly help me? Thank you! (num_processes_preprocessing=1, num_processes_segmentation_export=1)

sahikabetul commented 2 months ago

Hi,

For Mac, please set this environment variables false: export nnUNet_compile=f
And also run the training command with '-device mps', like that: nnUNetv2_train 666 2d all -p nnUNetPlansSpine -device mps

This will solve. At least, in my case, I could run the nnUNet training with this.

MIC-DKFZ / nnUNet

RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message #2162