bdzyubak / torch-control

A top-level repo for evaluating natively available models
MIT License
2 stars 0 forks source link

nnUnet Training Crashes on Random Batches #3

Closed bdzyubak closed 4 months ago

bdzyubak commented 4 months ago

Training crashes during the first epoch, validation, or second epoch with an unclear message - worker is no longer alive. Batch sampling is random, so this is tricky to reproduce. Further, data loading is parallelized, so a debug breakpoint cannot be used to see what is wrong with the data. The crash happens at this stage which could be either data loading or training itself:

nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py
self.train_step(next(self.dataloader_train))

Resolve and add tools to debug this in the future.

bdzyubak commented 4 months ago

The training is not gated by data loading - one epoch with one thread vs 8 threads takes just as long. Unfortunately, I am still experiencing the crash with the below error trace. Switching to non-parallel loading as it is a successful workaround without any observed slowdown. Changing number of processes to 0 (which actually means one) by default in

run_training.py 
parser.add_argument('--num_proc', type=str, default=0,
                        help="Select the number of parallel data loading processors. Use 0 for non-parallel debug.")

Exception ignored in tp_clear of: <class 'memoryview'> Traceback (most recent call last): File "D:\Programs\anaconda3\envs\nnunet\Lib\threading.py", line 1483, in current_thread def current_thread():

BufferError: memoryview has 1 exported buffer Exception in thread Thread-12 (results_loop): Traceback (most recent call last): File "D:\Programs\anaconda3\envs\nnunet\Lib\threading.py", line 1073, in _bootstrap_inner self.run() File "D:\Programs\anaconda3\envs\nnunet\Lib\threading.py", line 1010, in run self._target(*self._args, **self._kwargs) File "D:\Programs\anaconda3\envs\nnunet\Lib\site-packages\batchgenerators\dataloading\nondet_multi_threaded_augmenter.py", line 125, in results_loop raise e File "D:\Programs\anaconda3\envs\nnunet\Lib\site-packages\batchgenerators\dataloading\nondet_multi_threaded_augmenter.py", line 103, in results_loop raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the " RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message Exception ignored in tp_clear of: <class 'memoryview'> BufferError: memoryview has 1 exported buffer Traceback (most recent call last): File "D:\Programs\PyCharm Community Edition 2023.3.4\plugins\python-ce\helpers\pydev\pydevd.py", line 1534, in _exec pydev_imports.execfile(file, globals, locals) # execute the script ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\Programs\PyCharm Community Edition 2023.3.4\plugins\python-ce\helpers\pydev_pydev_imps_pydev_execfile.py", line 18, in execfile exec(compile(contents+"\n", file, 'exec'), glob, loc) File "D:\Source\torch-control\nnUNet\run_training.py", line 74, in run_training_entry() File "D:\Source\torch-control\nnUNet\run_training.py", line 67, in run_training_entry run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights, File "D:\Source\torch-control\nnUNet\nnunetv2\run\run_training_api.py", line 204, in run_training nnunet_trainer.run_training() File "D:\Source\torch-control\nnUNet\nnunetv2\training\nnUNetTrainer\nnUNetTrainer.py", line 1276, in run_training train_outputs.append(self.train_step(next(self.dataloader_train))) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\Programs\anaconda3\envs\nnunet\Lib\site-packages\batchgenerators\dataloading\nondet_multi_threaded_augmenter.py", line 196, in next item = self.__get_next_item() ^^^^^^^^^^^^^^^^^^^^^^ File "D:\Programs\anaconda3\envs\nnunet\Lib\site-packages\batchgenerators\dataloading\nondet_multi_threaded_augmenter.py", line 181, in __get_next_item raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the " RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message