Closed bdzyubak closed 4 months ago
The training is not gated by data loading - one epoch with one thread vs 8 threads takes just as long. Unfortunately, I am still experiencing the crash with the below error trace. Switching to non-parallel loading as it is a successful workaround without any observed slowdown. Changing number of processes to 0 (which actually means one) by default in
run_training.py
parser.add_argument('--num_proc', type=str, default=0,
help="Select the number of parallel data loading processors. Use 0 for non-parallel debug.")
Exception ignored in tp_clear of: <class 'memoryview'> Traceback (most recent call last): File "D:\Programs\anaconda3\envs\nnunet\Lib\threading.py", line 1483, in current_thread def current_thread():
BufferError: memoryview has 1 exported buffer Exception in thread Thread-12 (results_loop): Traceback (most recent call last): File "D:\Programs\anaconda3\envs\nnunet\Lib\threading.py", line 1073, in _bootstrap_inner self.run() File "D:\Programs\anaconda3\envs\nnunet\Lib\threading.py", line 1010, in run self._target(*self._args, **self._kwargs) File "D:\Programs\anaconda3\envs\nnunet\Lib\site-packages\batchgenerators\dataloading\nondet_multi_threaded_augmenter.py", line 125, in results_loop raise e File "D:\Programs\anaconda3\envs\nnunet\Lib\site-packages\batchgenerators\dataloading\nondet_multi_threaded_augmenter.py", line 103, in results_loop raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the " RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message Exception ignored in tp_clear of: <class 'memoryview'> BufferError: memoryview has 1 exported buffer Traceback (most recent call last): File "D:\Programs\PyCharm Community Edition 2023.3.4\plugins\python-ce\helpers\pydev\pydevd.py", line 1534, in _exec pydev_imports.execfile(file, globals, locals) # execute the script ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\Programs\PyCharm Community Edition 2023.3.4\plugins\python-ce\helpers\pydev_pydev_imps_pydev_execfile.py", line 18, in execfile exec(compile(contents+"\n", file, 'exec'), glob, loc) File "D:\Source\torch-control\nnUNet\run_training.py", line 74, in
run_training_entry() File "D:\Source\torch-control\nnUNet\run_training.py", line 67, in run_training_entry run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights, File "D:\Source\torch-control\nnUNet\nnunetv2\run\run_training_api.py", line 204, in run_training nnunet_trainer.run_training() File "D:\Source\torch-control\nnUNet\nnunetv2\training\nnUNetTrainer\nnUNetTrainer.py", line 1276, in run_training train_outputs.append(self.train_step(next(self.dataloader_train))) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\Programs\anaconda3\envs\nnunet\Lib\site-packages\batchgenerators\dataloading\nondet_multi_threaded_augmenter.py", line 196, in next item = self.__get_next_item() ^^^^^^^^^^^^^^^^^^^^^^ File "D:\Programs\anaconda3\envs\nnunet\Lib\site-packages\batchgenerators\dataloading\nondet_multi_threaded_augmenter.py", line 181, in __get_next_item raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the " RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message
Training crashes during the first epoch, validation, or second epoch with an unclear message - worker is no longer alive. Batch sampling is random, so this is tricky to reproduce. Further, data loading is parallelized, so a debug breakpoint cannot be used to see what is wrong with the data. The crash happens at this stage which could be either data loading or training itself:
Resolve and add tools to debug this in the future.