bowang-lab / U-Mamba

U-Mamba: Enhancing Long-range Dependency for Biomedical Image Segmentation
https://arxiv.org/abs/2401.04722
Apache License 2.0
639 stars 53 forks source link

backgroundWorker keeps dying at epoch 0 #56

Open AndrewForresterGit opened 1 month ago

AndrewForresterGit commented 1 month ago

I keep getting this same error at epoch 1. I've tried debugging and the only thing I believe to cause the error is that dist.is_initialized() returns False. Could this be the cause. If so, how would I fix this and if not what else could be the cause?

2024-08-05 11:23:25.063796: unpacking dataset...
2024-08-05 11:23:35.208095: unpacking done...
2024-08-05 11:23:35.208965: do_dummy_2d_data_aug: False
2024-08-05 11:23:35.220142: Unable to plot network architecture:
2024-08-05 11:23:35.220288: No module named 'hiddenlayer'
2024-08-05 11:23:35.227630:
2024-08-05 11:23:35.227772: Epoch 0
2024-08-05 11:23:35.227940: Current learning rate: 0.01
using pin_memory on device 0
Exception in thread Thread-4 (results_loop):
Traceback (most recent call last):
  File "/cvmfs/soft.computecanada.ca/easybuild/software/2023/x86-64-v3/Compiler/gcccore/python/3.10.13/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/cvmfs/soft.computecanada.ca/easybuild/software/2023/x86-64-v3/Compiler/gcccore/python/3.10.13/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/home/anfor306/venvs/projet-Umamba/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 125, in results_loop
    raise e
  File "/home/anfor306/venvs/projet-Umamba/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 103, in results_loop
    raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the "
RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message
Traceback (most recent call last):
  File "/home/anfor306/venvs/projet-Umamba/bin/nnUNetv2_train", line 33, in <module>
    sys.exit(load_entry_point('nnunetv2', 'console_scripts', 'nnUNetv2_train')())
  File "/lustre06/project/6092638/anfor306/U-Mamba/umamba/nnunetv2/run/run_training.py", line 268, in run_training_entry
    run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights,
  File "/lustre06/project/6092638/anfor306/U-Mamba/umamba/nnunetv2/run/run_training.py", line 204, in run_training
    nnunet_trainer.run_training()
  File "/lustre06/project/6092638/anfor306/U-Mamba/umamba/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 1258, in run_training
    train_outputs.append(self.train_step(next(self.dataloader_train)))
  File "/home/anfor306/venvs/projet-Umamba/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 196, in __next__
    item = self.__get_next_item()
  File "/home/anfor306/venvs/projet-Umamba/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 181, in __get_next_item
    raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the "
RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message
AyacodeYa commented 1 month ago

Hi, I have solved a similar problem to yours, but my problem was about the version of causal_conv1d. So, I solved it with the following command. git clone https://github.com/Dao-AILab/causal-conv1d.git cd causal-conv1d git checkout v1.1.1.post2 CAUSAL_CONV1D_FORCE_BUILD=TRUE pip install . nnUNetv2_train your_dataset_ID 2d all -tr nnUNetTrainerUMambaEnc -num_gpus 1