Thank you for your work. This work has been instrumental in establishing a baseline for semantic segmentation.
The most recent version of nnunetv2 (v2.5.1) exhibits a multi-threading issue possibly from batchgenerators package. Interestingly, the previous release of the nnunetv2 (v2.4.2) does NOT exhibit this error.
The following is information about the bug.
I am doing training in a docker container created from the following docker file:
FROM 763104351884.dkr.ecr.ca-central-1.amazonaws.com/pytorch-inference:2.3.0-gpu-py311-cu121-ubuntu20.04-ec2
RUN pip install nnunetv2==2.5.1
WORKDIR /home/ubuntu/nnunet_data # location of nnUNet training data
ENTRYPOINT ["/bin/bash"]
The above command builds a docker image with a local repository of nnunet with tag train. I start the docker container with the following command:
docker run --gpus all --ipc=host -it -v /home/ubuntu/nnunet_data:/home/ubuntu/nnunet_data nnunet:train
Inside the docker container I run the following command:
nnUNetv2_train 4 3d_fullres 0
where 4 is my dataset id created using the Dataset Format guide. Running training produces an error message. The following is the stack trace:
############################
INFO: You are using the old nnU-Net default plans. We have updated our recommendations. Please consider using those instead! Read more here: https://github.com/MIC-DKFZ/nnUNet/blob/master/documentation/resenc_presets.md
############################
Using device: cuda:0
#######################################################################
Please cite the following paper when using nnU-Net:
Isensee, F., Jaeger, P. F., Kohl, S. A., Petersen, J., & Maier-Hein, K. H. (2021). nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods, 18(2), 203-211.
#######################################################################
2024-10-16 20:05:58.225141: do_dummy_2d_data_aug: True
2024-10-16 20:05:58.225521: Using splits from existing split file: /home/ubuntu/nnunet_data/preprocessed/Dataset004_FDAWholeBody/splits_final.json
2024-10-16 20:05:58.225675: The split file contains 5 splits.
2024-10-16 20:05:58.225729: Desired fold for training: 0
2024-10-16 20:05:58.225777: This split has 8 training and 2 validation cases.
using pin_memory on device 0
free(): corrupted unsorted chunks
Exception in thread Thread-2 (results_loop):
Traceback (most recent call last):
File "/opt/conda/lib/python3.11/threading.py", line 1045, in _bootstrap_inner
self.run()
File "/opt/conda/lib/python3.11/threading.py", line 982, in run
self._target(*self._args, **self._kwargs)
File "/opt/conda/lib/python3.11/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 125, in results_loop
raise e
File "/opt/conda/lib/python3.11/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 103, in results_loop
raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the "
RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message
Traceback (most recent call last):
File "/opt/conda/bin/nnUNetv2_train", line 8, in <module>
sys.exit(run_training_entry())
^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/nnunetv2/run/run_training.py", line 275, in run_training_entry
run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights,
File "/opt/conda/lib/python3.11/site-packages/nnunetv2/run/run_training.py", line 211, in run_training
nnunet_trainer.run_training()
File "/opt/conda/lib/python3.11/site-packages/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 1362, in run_training
self.on_train_start()
File "/opt/conda/lib/python3.11/site-packages/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 903, in on_train_start
self.dataloader_train, self.dataloader_val = self.get_dataloaders()
^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 696, in get_dataloaders
_ = next(mt_gen_train)
^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 196, in __next__
item = self.__get_next_item()
^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 181, in __get_next_item
raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the "
RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message
I am working on creating a minimal (synthetic) dataset that exhibits this issue. Is there somewhere that I can upload this dataset so you can reproduce the bug?
Hello @FabianIsensee,
Thank you for your work. This work has been instrumental in establishing a baseline for semantic segmentation. The most recent version of nnunetv2 (
v2.5.1
) exhibits a multi-threading issue possibly frombatchgenerators
package. Interestingly, the previous release of the nnunetv2 (v2.4.2
) does NOT exhibit this error.The following is information about the bug.
I am doing training in a docker container created from the following docker file:
The above command builds a docker image with a local repository of
nnunet
with tagtrain
. I start the docker container with the following command:Inside the docker container I run the following command:
where
4
is my dataset id created using the Dataset Format guide. Running training produces an error message. The following is the stack trace: