MIC-DKFZ / nnUNet

Apache License 2.0
5.91k stars 1.76k forks source link

free(): corrupted unsorted chunks #2557

Open siavashk opened 1 month ago

siavashk commented 1 month ago

Hello @FabianIsensee,

Thank you for your work. This work has been instrumental in establishing a baseline for semantic segmentation. The most recent version of nnunetv2 (v2.5.1) exhibits a multi-threading issue possibly from batchgenerators package. Interestingly, the previous release of the nnunetv2 (v2.4.2) does NOT exhibit this error.

The following is information about the bug.

I am doing training in a docker container created from the following docker file:

FROM 763104351884.dkr.ecr.ca-central-1.amazonaws.com/pytorch-inference:2.3.0-gpu-py311-cu121-ubuntu20.04-ec2
RUN pip install nnunetv2==2.5.1
WORKDIR /home/ubuntu/nnunet_data # location of nnUNet training data
ENTRYPOINT ["/bin/bash"]

The above command builds a docker image with a local repository of nnunet with tag train. I start the docker container with the following command:

docker run --gpus all --ipc=host -it -v /home/ubuntu/nnunet_data:/home/ubuntu/nnunet_data nnunet:train

Inside the docker container I run the following command:

nnUNetv2_train 4 3d_fullres 0

where 4 is my dataset id created using the Dataset Format guide. Running training produces an error message. The following is the stack trace:

############################
INFO: You are using the old nnU-Net default plans. We have updated our recommendations. Please consider using those instead! Read more here: https://github.com/MIC-DKFZ/nnUNet/blob/master/documentation/resenc_presets.md
############################

Using device: cuda:0

#######################################################################
Please cite the following paper when using nnU-Net:
Isensee, F., Jaeger, P. F., Kohl, S. A., Petersen, J., & Maier-Hein, K. H. (2021). nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods, 18(2), 203-211.
#######################################################################

2024-10-16 20:05:58.225141: do_dummy_2d_data_aug: True
2024-10-16 20:05:58.225521: Using splits from existing split file: /home/ubuntu/nnunet_data/preprocessed/Dataset004_FDAWholeBody/splits_final.json
2024-10-16 20:05:58.225675: The split file contains 5 splits.
2024-10-16 20:05:58.225729: Desired fold for training: 0
2024-10-16 20:05:58.225777: This split has 8 training and 2 validation cases.
using pin_memory on device 0
free(): corrupted unsorted chunks
Exception in thread Thread-2 (results_loop):
Traceback (most recent call last):
  File "/opt/conda/lib/python3.11/threading.py", line 1045, in _bootstrap_inner
    self.run()
  File "/opt/conda/lib/python3.11/threading.py", line 982, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.11/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 125, in results_loop
    raise e
  File "/opt/conda/lib/python3.11/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 103, in results_loop
    raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the "
RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message
Traceback (most recent call last):
  File "/opt/conda/bin/nnUNetv2_train", line 8, in <module>
    sys.exit(run_training_entry())
             ^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/nnunetv2/run/run_training.py", line 275, in run_training_entry
    run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights,
  File "/opt/conda/lib/python3.11/site-packages/nnunetv2/run/run_training.py", line 211, in run_training
    nnunet_trainer.run_training()
  File "/opt/conda/lib/python3.11/site-packages/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 1362, in run_training
    self.on_train_start()
  File "/opt/conda/lib/python3.11/site-packages/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 903, in on_train_start
    self.dataloader_train, self.dataloader_val = self.get_dataloaders()
                                                 ^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 696, in get_dataloaders
    _ = next(mt_gen_train)
        ^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 196, in __next__
    item = self.__get_next_item()
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 181, in __get_next_item
    raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the "
RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message
siavashk commented 1 month ago

I am working on creating a minimal (synthetic) dataset that exhibits this issue. Is there somewhere that I can upload this dataset so you can reproduce the bug?

siavashk commented 1 month ago

Possible duplicate #2523.