MIC-DKFZ / nnUNet

Apache License 2.0
5.6k stars 1.7k forks source link

Issue with autocast learning to problem in the background workers #2031

Closed rooskraaijveld closed 5 months ago

rooskraaijveld commented 5 months ago

Hi there!

I keep getting this error and I'm unsure how to solve it. I know other issues also address this issue, but the solutions do not work for me (https://github.com/MIC-DKFZ/nnUNet/issues/1999). Could someone help out? ` Singularity> nnUNetv2_train 800 3d_fullres 0 Using device: cuda:0

####################################################################### Please cite the following paper when using nnU-Net: Isensee, F., Jaeger, P. F., Kohl, S. A., Petersen, J., & Maier-Hein, K. H. (2021). nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods, 18(2), 203-211. #######################################################################

This is the configuration used by this training: Configuration name: 3d_fullres {'data_identifier': 'nnUNetPlans_3d_fullres', 'preprocessor_name': 'DefaultPreprocessor', 'batch_size': 2, 'patch_size': [112, 112, 192], 'median_image_size_in_voxels': [512.0, 518.5, 973.5], 'spacing': [0.7939453125, 0.7835781276226044, 0.699999988079071], 'normalization_schemes': ['ZScoreNormalization'], 'use_mask_for_norm': [False], 'UNet_class_name': 'PlainConvUNet', 'UNet_base_num_features': 32, 'n_conv_per_stage_encoder': [2, 2, 2, 2, 2, 2], 'n_conv_per_stage_decoder': [2, 2, 2, 2, 2], 'num_pool_per_axis': [4, 4, 5], 'pool_op_kernel_sizes': [[1, 1, 1], [2, 2, 2], [2, 2, 2], [2, 2, 2], [2, 2, 2], [1, 1, 2]], 'conv_kernel_sizes': [[3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3]], 'unet_max_num_features': 320, 'resampling_fn_data': 'resample_data_or_seg_to_shape', 'resampling_fn_seg': 'resample_data_or_seg_to_shape', 'resampling_fn_data_kwargs': {'is_seg': False, 'order': 3, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_seg_kwargs': {'is_seg': True, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_probabilities': 'resample_data_or_seg_to_shape', 'resampling_fn_probabilities_kwargs': {'is_seg': False, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'batch_dice': True}

These are the global plan.json settings: {'dataset_name': 'Dataset800_TRIPPwholebody', 'plans_name': 'nnUNetPlans', 'original_median_spacing_after_transp': [0.7939453125, 0.7835781276226044, 0.699999988079071], 'original_median_shape_after_transp': [512, 512, 972], 'image_reader_writer': 'SimpleITKIO', 'transpose_forward': [1, 0, 2], 'transpose_backward': [1, 0, 2], 'experiment_planner_used': 'ExperimentPlanner', 'label_manager': 'LabelManager', 'foreground_intensity_properties_per_channel': {'0': {'max': 3071.0, 'mean': 48.972293610094, 'median': 58.0, 'min': -1024.0, 'percentile_00_5': -485.0, 'percentile_99_5': 597.0, 'std': 125.88474284618155}}}

2024-03-22 11:59:27.459956: unpacking dataset... 2024-03-22 11:59:34.318358: unpacking done... 2024-03-22 11:59:34.323149: do_dummy_2d_data_aug: False 2024-03-22 11:59:34.330559: Using splits from existing split file: /BIP/_Roos/PhD/Data/Segmentation/nnUNet_TRIPP/nnUNet_raw_data_base/nnUNet_preprocessed/Dataset800_TRIPPwholebody/splits_final.json 2024-03-22 11:59:34.333823: The split file contains 5 splits. 2024-03-22 11:59:34.335809: Desired fold for training: 0 2024-03-22 11:59:34.337726: This split has 57 training and 15 validation cases. 2024-03-22 11:59:34.483436: Unable to plot network architecture: 2024-03-22 11:59:34.487802: No module named 'hiddenlayer' 2024-03-22 11:59:34.588846: 2024-03-22 11:59:34.595683: Epoch 0 2024-03-22 11:59:34.599586: Current learning rate: 0.01 using pin_memory on device 0 Traceback (most recent call last): File "/usr/local/bin/nnUNetv2_train", line 8, in sys.exit(run_training_entry()) File "/usr/local/lib/python3.8/site-packages/nnunetv2/run/run_training.py", line 247, in run_training_entry run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights, File "/usr/local/lib/python3.8/site-packages/nnunetv2/run/run_training.py", line 190, in run_training nnunet_trainer.run_training() File "/usr/local/lib/python3.8/site-packages/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 1211, in run_training train_outputs.append(self.train_step(next(self.dataloader_train))) File "/usr/local/lib/python3.8/site-packages/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 848, in train_step with autocast(self.device.type, enabled=True) if self.device.type == 'cuda' else dummy_context(): TypeError: init() got multiple values for argument 'enabled' Exception in thread Thread-4: Traceback (most recent call last): File "/usr/local/lib/python3.8/threading.py", line 932, in _bootstrap_inner self.run() File "/usr/local/lib/python3.8/threading.py", line 870, in run self._target(*self._args, **self._kwargs) File "/user/roos/.local/lib/python3.8/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 125, in results_loop raise e File "/user/roos/.local/lib/python3.8/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 103, in results_loop raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the " RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message

`

ancestor-mithril commented 5 months ago

You are using an old version of pytorch. nnUNet supports torch>=2.0.