MIC-DKFZ / nnUNet

Apache License 2.0
5.6k stars 1.7k forks source link

RuntimeError #1343

Closed CoderJackZhu closed 1 year ago

CoderJackZhu commented 1 year ago

When I train this model, I always get this error information. I don't know why,I wander how to solve this problem. Thank you. (/data/ailab/2022/ZYJ/nnunet) [stu0301@gpu03 nnUNet]$ nnUNetv2_train 137 3d_fullres 0 --npz Using device: cuda:0

####################################################################### Please cite the following paper when using nnU-Net: Isensee, F., Jaeger, P. F., Kohl, S. A., Petersen, J., & Maier-Hein, K. H. (2021). nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods, 18(2), 203-211. #######################################################################

This is the configuration used by this training: Configuration name: 3d_fullres {'data_identifier': 'nnUNetPlans_3d_fullres', 'preprocessor_name': 'DefaultPreprocessor', 'batch_size': 2, 'patch_size': [128, 128, 128], 'median_image_size_in_voxels': [140.0, 171.0, 137.0], 'spacing': [1.0, 1.0, 1.0], 'normalization_schemes': ['ZScoreNormalization', 'ZScoreNormalization', 'ZScoreNormalization', 'ZScoreNormalization'], 'use_mask_for_norm': [True, True, True, True], 'UNet_class_name': 'PlainConvUNet', 'UNet_base_num_features': 32, 'n_conv_per_stage_encoder': [2, 2, 2, 2, 2, 2], 'n_conv_per_stage_decoder': [2, 2, 2, 2, 2], 'num_pool_per_axis': [5, 5, 5], 'pool_op_kernel_sizes': [[1, 1, 1], [2, 2, 2], [2, 2, 2], [2, 2, 2], [2, 2, 2], [2, 2, 2]], 'conv_kernel_sizes': [[3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3]], 'unet_max_num_features': 320, 'resampling_fn_data': 'resample_data_or_seg_to_shape', 'resampling_fn_seg': 'resample_data_or_seg_to_shape', 'resampling_fn_data_kwargs': {'is_seg': False, 'order': 3, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_seg_kwargs': {'is_seg': True, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_probabilities': 'resample_data_or_seg_to_shape', 'resampling_fn_probabilities_kwargs': {'is_seg': False, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'batch_dice': False}

These are the global plan.json settings: {'dataset_name': 'Dataset137_BraTS2021', 'plans_name': 'nnUNetPlans', 'original_median_spacing_after_transp': [1.0, 1.0, 1.0], 'original_median_shape_after_transp': [140, 171, 137], 'image_reader_writer': 'SimpleITKIO', 'transpose_forward': [0, 1, 2], 'transpose_backward': [0, 1, 2], 'experiment_planner_used': 'ExperimentPlanner', 'label_manager': 'LabelManager', 'foreground_intensity_properties_per_channel': {'0': {'max': 95242.25, 'mean': 871.816650390625, 'median': 407.0, 'min': 0.10992202162742615, 'percentile_00_5': 55.0, 'percentile_99_5': 5825.0, 'std': 2023.5313720703125}, '1': {'max': 1905559.25, 'mean': 1698.2144775390625, 'median': 552.0, 'min': 0.0, 'percentile_00_5': 47.0, 'percentile_99_5': 8322.0, 'std': 18787.4140625}, '2': {'max': 4438107.0, 'mean': 2141.349365234375, 'median': 738.0, 'min': 0.0, 'percentile_00_5': 110.0, 'percentile_99_5': 10396.0, 'std': 45159.37890625}, '3': {'max': 580014.3125, 'mean': 995.436279296875, 'median': 512.3143920898438, 'min': 0.0, 'percentile_00_5': 108.0, 'percentile_99_5': 11925.0, 'std': 4629.87939453125}}}

2023-03-23 22:53:42.012139: unpacking dataset... 2023-03-23 22:57:15.189137: unpacking done... 2023-03-23 22:57:15.190897: do_dummy_2d_data_aug: False 2023-03-23 22:57:15.205438: Using splits from existing split file: /data/ailab/2022/ZYJ/Dataset/nnUNet_preprocessed/Dataset137_BraTS2021/splits_final.json 2023-03-23 22:57:15.207055: The split file contains 5 splits. 2023-03-23 22:57:15.207163: Desired fold for training: 0 2023-03-23 22:57:15.207259: This split has 1000 training and 251 validation cases. 2023-03-23 22:57:15.395872: Unable to plot network architecture: 2023-03-23 22:57:15.396091: No module named 'hiddenlayer' 2023-03-23 22:57:23.167642: 2023-03-23 22:57:23.167933: Epoch 0 2023-03-23 22:57:23.168380: Current learning rate: 0.01 using pin_memory on device 0 OpenBLAS blas_thread_init: pthread_create failed for thread 20 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 21 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 22 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 23 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 24 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 25 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 26 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 27 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 28 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 29 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 30 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 31 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 32 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 33 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 34 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 35 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 36 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 37 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 38 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 39 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 40 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 41 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 42 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 43 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 44 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 45 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 46 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 47 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 48 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 49 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 50 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 51 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 52 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 53 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 54 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 55 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 56 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 57 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 58 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 59 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 60 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 61 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 62 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 63 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max Process Process-21: Traceback (most recent call last): File "/data/ailab/2022/ZYJ/nnunet/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/data/ailab/2022/ZYJ/nnunet/lib/python3.9/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/data/ailab/2022/ZYJ/nnunet/lib/python3.9/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 41, in producer with threadpool_limits(1, None): File "/data/ailab/2022/ZYJ/nnunet/lib/python3.9/site-packages/threadpoolctl.py", line 373, in init super().init(ThreadpoolController(), limits=limits, user_api=user_api) File "/data/ailab/2022/ZYJ/nnunet/lib/python3.9/site-packages/threadpoolctl.py", line 166, in init self._set_threadpool_limits() File "/data/ailab/2022/ZYJ/nnunet/lib/python3.9/site-packages/threadpoolctl.py", line 299, in _set_threadpool_limits lib_controller.set_num_threads(num_threads) File "/data/ailab/2022/ZYJ/nnunet/lib/python3.9/site-packages/threadpoolctl.py", line 865, in set_num_threads return set_func(num_threads) KeyboardInterrupt using pin_memory on device 0 OpenBLAS blas_thread_init: pthread_create failed for thread 19 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 20 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 21 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 22 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 23 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 24 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 25 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 26 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 27 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 28 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 29 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 30 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 31 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 32 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 33 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 34 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 35 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 36 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 37 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 38 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 39 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 40 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 41 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 42 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 43 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 44 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 45 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 46 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 47 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 48 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 49 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 50 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 51 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 52 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 53 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 54 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 55 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 56 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 57 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 58 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 59 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 60 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 61 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 62 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max OpenBLAS blas_thread_init: pthread_create failed for thread 63 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 513046 max Process Process-22: Traceback (most recent call last): File "/data/ailab/2022/ZYJ/nnunet/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/data/ailab/2022/ZYJ/nnunet/lib/python3.9/multiprocessing/process.py", line 108, in run self._target(*self._args, self._kwargs) File "/data/ailab/2022/ZYJ/nnunet/lib/python3.9/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 41, in producer with threadpool_limits(1, None): File "/data/ailab/2022/ZYJ/nnunet/lib/python3.9/site-packages/threadpoolctl.py", line 373, in init super().init(ThreadpoolController(), limits=limits, user_api=user_api) File "/data/ailab/2022/ZYJ/nnunet/lib/python3.9/site-packages/threadpoolctl.py", line 166, in init self._set_threadpool_limits() File "/data/ailab/2022/ZYJ/nnunet/lib/python3.9/site-packages/threadpoolctl.py", line 299, in _set_threadpool_limits lib_controller.set_num_threads(num_threads) File "/data/ailab/2022/ZYJ/nnunet/lib/python3.9/site-packages/threadpoolctl.py", line 865, in set_num_threads return set_func(num_threads) KeyboardInterrupt Exception in thread Thread-5: Traceback (most recent call last): File "/data/ailab/2022/ZYJ/nnunet/lib/python3.9/threading.py", line 980, in _bootstrap_inner self.run() File "/data/ailab/2022/ZYJ/nnunet/lib/python3.9/threading.py", line 917, in run self._target(*self._args, *self._kwargs) File "/data/ailab/2022/ZYJ/nnunet/lib/python3.9/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 125, in results_loop raise e File "/data/ailab/2022/ZYJ/nnunet/lib/python3.9/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 103, in results_loop raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the " RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message Traceback (most recent call last): File "/data/ailab/2022/ZYJ/nnunet/bin/nnUNetv2_train", line 33, in sys.exit(load_entry_point('nnunetv2', 'console_scripts', 'nnUNetv2_train')()) File "/data/ailab/2022/ZYJ/nnUNet/nnunetv2/run/run_training.py", line 247, in run_training_entry run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights, File "/data/ailab/2022/ZYJ/nnUNet/nnunetv2/run/run_training.py", line 190, in run_training nnunet_trainer.run_training() File "/data/ailab/2022/ZYJ/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 1217, in run_training val_outputs.append(self.validation_step(next(self.dataloader_val))) File "/data/ailab/2022/ZYJ/nnunet/lib/python3.9/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 196, in next item = self.__get_next_item() File "/data/ailab/2022/ZYJ/nnunet/lib/python3.9/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 181, in __get_next_item raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the " RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message Exception in thread Thread-4: Traceback (most recent call last): File "/data/ailab/2022/ZYJ/nnunet/lib/python3.9/threading.py", line 980, in _bootstrap_inner self.run() File "/data/ailab/2022/ZYJ/nnunet/lib/python3.9/threading.py", line 917, in run self._target(self._args, self._kwargs) File "/data/ailab/2022/ZYJ/nnunet/lib/python3.9/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 125, in results_loop raise e File "/data/ailab/2022/ZYJ/nnunet/lib/python3.9/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 103, in results_loop raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the " RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message

FabianIsensee commented 1 year ago

Hi there, interesting stuff. I have never had that problem. Let's see. Google tells me that this error message mostly appears if thread limits are exceeded but that does not appear to be your problem. Can you please try the following: OMP_NUM_THREADS=1 nnUNetv2_train 137 3d_fullres 0 --npz

Can you please also confirm that this appears on other hardware (if possible)? If you are running this on a compute cluster, it might make sense to even try an entirely different setup (such as a local workstation) to make sure it's not some configuration problem of the operating system

FabianIsensee commented 1 year ago

The error appears in an external library that nnU-Net (or rather batchgenerators) is using. Maybe it would help to also open an issue there and ask for advice: https://github.com/joblib/threadpoolctl

CoderJackZhu commented 1 year ago

Thank you very much. After following your instruction OMP_NUM_THREADS=1 nnUNetv2_train 137 3d_fullres 0 --npz, I succeed solve this problem. I have not tried to run this code on another computer.

FabianIsensee commented 1 year ago

OK thanks for the feedback. It appears that we still need OMP_NUM_THREADS . I was hoping we could ignore that 🙈 On our systems at least it works without. Too bad...

FabianIsensee commented 1 year ago

Can I ask you to test something for me? Doesn't take long

CoderJackZhu commented 1 year ago

I'm willing to help you. But my machine seems to have something wrong and crashed; it may take a minute tomorrow or later. The crash may not be caused by this program, I will let you know when the machine is back to normal

CoderJackZhu commented 1 year ago

Is there any test I need to do, the machine seems to be normal

FabianIsensee commented 1 year ago

Can you please

Thanks!

CoderJackZhu commented 1 year ago

(/data/ailab/2022/ZYJ/nnunet) [stu0301@gpu03 nnUNet]$ nnUNetv2_train 137 3d_fullres 3 --npz Traceback (most recent call last): File "/data/ailab/2022/ZYJ/nnunet/bin/nnUNetv2_train", line 33, in sys.exit(load_entry_point('nnunetv2', 'console_scripts', 'nnUNetv2_train')()) File "/data/ailab/2022/ZYJ/nnunet/bin/nnUNetv2_train", line 25, in importlib_load_entry_point return next(matches).load() File "/data/ailab/2022/ZYJ/nnunet/lib/python3.9/importlib/metadata.py", line 86, in load module = import_module(match.group('module')) File "/data/ailab/2022/ZYJ/nnunet/lib/python3.9/importlib/init.py", line 127, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "", line 1030, in _gcd_import File "", line 1007, in _find_and_load File "", line 972, in _find_and_load_unlocked File "", line 228, in _call_with_frames_removed File "", line 1030, in _gcd_import File "", line 1007, in _find_and_load File "", line 972, in _find_and_load_unlocked File "", line 228, in _call_with_frames_removed File "", line 1030, in _gcd_import File "", line 1007, in _find_and_load File "", line 986, in _find_and_load_unlocked File "", line 680, in _load_unlocked File "", line 850, in exec_module File "", line 228, in _call_with_frames_removed File "/data/ailab/2022/ZYJ/nnUNet/nnunetv2/init.py", line 2, in os.environ['OMP_NUM_THREADS']=1 File "/data/ailab/2022/ZYJ/nnunet/lib/python3.9/os.py", line 684, in setitem value = self.encodevalue(value) File "/data/ailab/2022/ZYJ/nnunet/lib/python3.9/os.py", line 756, in encode raise TypeError("str expected, not %s" % type(value).name) TypeError: str expected, not int

FabianIsensee commented 1 year ago

My bad, the code I sent you is wrong. It should be:

import os
os.environ['OMP_NUM_THREADS']="1"
CoderJackZhu commented 1 year ago

The code works correctly

FabianIsensee commented 1 year ago

Fantastic :-) Thanks! I will need to wait a bit to see if more people have the same problem and if so then I have to reintroduce OMP_NUM_THREADS (I dropped this when moving from v1 top v2 to make the installation simpler)

wujingweb commented 1 year ago

我遇到同样的错误,使用os.environ['OMP_NUM_THREADS']="1"不行,有可能是线程开太多了,nvidia-smi看不出来,使用 fuser -v /dev/nvidia* |awk '{for(i=1;i<=NF;i++)print "kill -9 " $i;}' | sh 把所有线程杀死,就可以运行了

FabianIsensee commented 1 year ago

?

akeebatra commented 1 year ago

I am still getting the same error even after running with OMP_NUM_THREADS=1 My hardware configuration: Mac pro early 2015 mac os 10.14 ram: 16gb gpu: Intel Iris Graphics 6100 1536 MB

(base) Akshays-MacBook-Pro:~ akshay$ OMP_NUM_THREADS=1 nnUNetv2_train 1 2d 0 -device cpu /opt/anaconda3/lib/python3.9/site-packages/scipy/init.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.25.0 warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}" Using device: cpu

####################################################################### Please cite the following paper when using nnU-Net: Isensee, F., Jaeger, P. F., Kohl, S. A., Petersen, J., & Maier-Hein, K. H. (2021). nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods, 18(2), 203-211. #######################################################################

This is the configuration used by this training: Configuration name: 2d {'data_identifier': 'nnUNetPlans_2d', 'preprocessor_name': 'DefaultPreprocessor', 'batch_size': 13, 'patch_size': [768, 320], 'median_image_size_in_voxels': [3180.0, 1498.5], 'spacing': [1.0, 1.0], 'normalization_schemes': ['ZScoreNormalization'], 'use_mask_for_norm': [True], 'UNet_class_name': 'PlainConvUNet', 'UNet_base_num_features': 32, 'n_conv_per_stage_encoder': [2, 2, 2, 2, 2, 2, 2, 2], 'n_conv_per_stage_decoder': [2, 2, 2, 2, 2, 2, 2], 'num_pool_per_axis': [7, 6], 'pool_op_kernel_sizes': [[1, 1], [2, 2], [2, 2], [2, 2], [2, 2], [2, 2], [2, 2], [2, 1]], 'conv_kernel_sizes': [[3, 3], [3, 3], [3, 3], [3, 3], [3, 3], [3, 3], [3, 3], [3, 3]], 'unet_max_num_features': 512, 'resampling_fn_data': 'resample_data_or_seg_to_shape', 'resampling_fn_seg': 'resample_data_or_seg_to_shape', 'resampling_fn_data_kwargs': {'is_seg': False, 'order': 3, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_seg_kwargs': {'is_seg': True, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_probabilities': 'resample_data_or_seg_to_shape', 'resampling_fn_probabilities_kwargs': {'is_seg': False, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'batch_dice': True}

These are the global plan.json settings: {'dataset_name': 'Dataset001_InBreast', 'plans_name': 'nnUNetPlans', 'original_median_spacing_after_transp': [999.0, 1.0, 1.0], 'original_median_shape_after_transp': [1, 3180, 1498], 'image_reader_writer': 'NaturalImage2DIO', 'transpose_forward': [0, 1, 2], 'transpose_backward': [0, 1, 2], 'experiment_planner_used': 'ExperimentPlanner', 'label_manager': 'LabelManager', 'foreground_intensity_properties_per_channel': {'0': {'max': 255.0, 'mean': 171.5933231709939, 'median': 176.0, 'min': 69.0, 'percentile_00_5': 101.0, 'percentile_99_5': 230.0, 'std': 30.106854637129793}}}

2023-06-19 20:05:54.976702: unpacking dataset... /opt/anaconda3/lib/python3.9/site-packages/scipy/init.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.25.0 warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}" /opt/anaconda3/lib/python3.9/site-packages/scipy/init.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.25.0 warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}" 2023-06-19 20:06:02.614134: unpacking done... 2023-06-19 20:06:02.616596: do_dummy_2d_data_aug: False 2023-06-19 20:06:02.618724: Using splits from existing split file: /Users/akshay/Documents/Master Project/Breast segmenatation Unet/nnUNet_preprocessed/Dataset001_InBreast/splits_final.json 2023-06-19 20:06:02.619395: The split file contains 5 splits. 2023-06-19 20:06:02.619575: Desired fold for training: 0 2023-06-19 20:06:02.619777: This split has 68 training and 18 validation cases. 2023-06-19 20:06:02.789835: Unable to plot network architecture: 2023-06-19 20:06:02.790218: No module named 'hiddenlayer' 2023-06-19 20:06:02.818363: 2023-06-19 20:06:02.818888: Epoch 0 2023-06-19 20:06:02.819498: Current learning rate: 0.01 /opt/anaconda3/lib/python3.9/site-packages/scipy/init.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.25.0 warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}" /opt/anaconda3/lib/python3.9/site-packages/scipy/init.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.25.0 warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}" /opt/anaconda3/lib/python3.9/site-packages/scipy/init.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.25.0 warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}" /opt/anaconda3/lib/python3.9/site-packages/scipy/init.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.25.0 warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}" Process Process-6: Process Process-4: Process Process-3: Traceback (most recent call last): File "/opt/anaconda3/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/opt/anaconda3/lib/python3.9/multiprocessing/process.py", line 108, in run self._target(*self._args, self._kwargs) File "/opt/anaconda3/lib/python3.9/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 41, in producer with threadpool_limits(1, None): File "/opt/anaconda3/lib/python3.9/site-packages/threadpoolctl.py", line 171, in init self._original_info = self._set_threadpool_limits() File "/opt/anaconda3/lib/python3.9/site-packages/threadpoolctl.py", line 268, in _set_threadpool_limits modules = _ThreadpoolInfo(prefixes=self._prefixes, File "/opt/anaconda3/lib/python3.9/site-packages/threadpoolctl.py", line 340, in init self._load_modules() File "/opt/anaconda3/lib/python3.9/site-packages/threadpoolctl.py", line 371, in _load_modules self._find_modules_with_dyld() File "/opt/anaconda3/lib/python3.9/site-packages/threadpoolctl.py", line 428, in _find_modules_with_dyld self._make_module_from_path(filepath) File "/opt/anaconda3/lib/python3.9/site-packages/threadpoolctl.py", line 515, in _make_module_from_path module = module_class(filepath, prefix, user_api, internal_api) File "/opt/anaconda3/lib/python3.9/site-packages/threadpoolctl.py", line 606, in init self.version = self.get_version() File "/opt/anaconda3/lib/python3.9/site-packages/threadpoolctl.py", line 646, in get_version config = get_config().split() AttributeError: 'NoneType' object has no attribute 'split' Traceback (most recent call last): File "/opt/anaconda3/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/opt/anaconda3/lib/python3.9/multiprocessing/process.py", line 108, in run self._target(*self._args, *self._kwargs) File "/opt/anaconda3/lib/python3.9/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 41, in producer with threadpool_limits(1, None): File "/opt/anaconda3/lib/python3.9/site-packages/threadpoolctl.py", line 171, in init self._original_info = self._set_threadpool_limits() File "/opt/anaconda3/lib/python3.9/site-packages/threadpoolctl.py", line 268, in _set_threadpool_limits modules = _ThreadpoolInfo(prefixes=self._prefixes, File "/opt/anaconda3/lib/python3.9/site-packages/threadpoolctl.py", line 340, in init self._load_modules() File "/opt/anaconda3/lib/python3.9/site-packages/threadpoolctl.py", line 371, in _load_modules self._find_modules_with_dyld() File "/opt/anaconda3/lib/python3.9/site-paTraceback (most recent call last): File "/opt/anaconda3/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/opt/anaconda3/lib/python3.9/multiprocessing/process.py", line 108, in run self._target(self._args, self._kwargs) File "/opt/anaconda3/lib/python3.9/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 41, in producer with threadpool_limits(1, None): File "/opt/anaconda3/lib/python3.9/site-packages/threadpoolctl.py", line 171, in init self._original_info = self._set_threadpool_limits() File "/opt/anaconda3/lib/python3.9/site-packages/threadpoolctl.py", line 268, in _set_threadpool_limits modules = _ThreadpoolInfo(prefixes=self._prefixes, File "/opt/anaconda3/lib/python3.9/site-packages/threadpoolctl.py", line 340, in init self._load_modules() File "/opt/anaconda3/lib/python3.9/site-packages/threadpoolctl.py", line 371, in _load_modules self._find_modules_with_dyld() File "/opt/anaconda3/lib/python3.9/site-packages/threadpoolctl.py", line 428, in _find_modules_with_dyld self._make_module_from_path(filepath) File "/opt/anaconda3/lib/python3.9/site-packages/threadpoolctl.py", line 515, in _make_module_from_path module = module_class(filepath, prefix, user_api, internal_api) File "/opt/anaconda3/lib/python3.9/site-packages/threadpoolctl.py", line 606, in init self.version = self.get_version() File "/opt/anaconda3/lib/python3.9/site-packages/threadpoolctl.py", line 646, in get_version config = get_config().split() AttributeError: 'NoneType' object has no attribute 'split' ckages/threadpoolctl.py", line 428, in _find_modules_with_dyld self._make_module_from_path(filepath) File "/opt/anaconda3/lib/python3.9/site-packages/threadpoolctl.py", line 515, in _make_module_from_path module = module_class(filepath, prefix, user_api, internal_api) File "/opt/anaconda3/lib/python3.9/site-packages/threadpoolctl.py", line 606, in init self.version = self.get_version() File "/opt/anaconda3/lib/python3.9/site-packages/threadpoolctl.py", line 646, in get_version config = get_config().split() AttributeError: 'NoneType' object has no attribute 'split' Process Process-5: Traceback (most recent call last): File "/opt/anaconda3/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/opt/anaconda3/lib/python3.9/multiprocessing/process.py", line 108, in run self._target(*self._args, *self._kwargs) File "/opt/anaconda3/lib/python3.9/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 41, in producer with threadpool_limits(1, None): File "/opt/anaconda3/lib/python3.9/site-packages/threadpoolctl.py", line 171, in init self._original_info = self._set_threadpool_limits() File "/opt/anaconda3/lib/python3.9/site-packages/threadpoolctl.py", line 268, in _set_threadpool_limits modules = _ThreadpoolInfo(prefixes=self._prefixes, File "/opt/anaconda3/lib/python3.9/site-packages/threadpoolctl.py", line 340, in init self._load_modules() File "/opt/anaconda3/lib/python3.9/site-packages/threadpoolctl.py", line 371, in _load_modules self._find_modules_with_dyld() File "/opt/anaconda3/lib/python3.9/site-packages/threadpoolctl.py", line 428, in _find_modules_with_dyld self._make_module_from_path(filepath) File "/opt/anaconda3/lib/python3.9/site-packages/threadpoolctl.py", line 515, in _make_module_from_path module = module_class(filepath, prefix, user_api, internal_api) File "/opt/anaconda3/lib/python3.9/site-packages/threadpoolctl.py", line 606, in init self.version = self.get_version() File "/opt/anaconda3/lib/python3.9/site-packages/threadpoolctl.py", line 646, in get_version config = get_config().split() AttributeError: 'NoneType' object has no attribute 'split' Exception in thread Thread-4: Traceback (most recent call last): File "/opt/anaconda3/lib/python3.9/threading.py", line 973, in _bootstrap_inner self.run() File "/opt/anaconda3/lib/python3.9/threading.py", line 910, in run self._target(self._args, **self._kwargs) File "/opt/anaconda3/lib/python3.9/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 125, in results_loop raise e File "/opt/anaconda3/lib/python3.9/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 103, in results_loop raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the " RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message Traceback (most recent call last): File "/opt/anaconda3/bin/nnUNetv2_train", line 8, in sys.exit(run_training_entry()) File "/opt/anaconda3/lib/python3.9/site-packages/nnunetv2/run/run_training.py", line 252, in run_training_entry run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights, File "/opt/anaconda3/lib/python3.9/site-packages/nnunetv2/run/run_training.py", line 195, in run_training nnunet_trainer.run_training() File "/opt/anaconda3/lib/python3.9/site-packages/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 1211, in run_training train_outputs.append(self.train_step(next(self.dataloader_train))) File "/opt/anaconda3/lib/python3.9/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 196, in next item = self.__get_next_item() File "/opt/anaconda3/lib/python3.9/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 181, in __get_next_item raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the " RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message

ancestor-mithril commented 1 year ago

for single threaded nnUNet, use

nnUNet_n_proc_DA=0 nnUNetv2_train 1 2d 0 -device cpu
giuliarubiu commented 1 year ago

hi I also get the following error: Traceback (most recent call last): File "/usr/local/bin/nnUNetv2_train", line 33, in sys.exit(load_entry_point('nnunetv2', 'console_scripts', 'nnUNetv2_train')()) File "/data/users/giulia/scripts/nnUNet/nnunetv2/run/run_training.py", line 253, in run_training_entry run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights, File "/data/users/giulia/scripts/nnUNet/nnunetv2/run/run_training.py", line 196, in run_training nnunet_trainer.run_training() File "/data/users/giulia/scripts/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 1227, in run_training train_outputs.append(self.train_step(next(self.dataloader_train))) File "/data/users/giulia/scripts/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 867, in train_step output = self.network(data) File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, kwargs) File "/usr/local/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py", line 82, in forward return self.dynamo_ctx(self._orig_mod.forward)(*args, *kwargs) File "/usr/local/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py", line 209, in _fn return fn(args, kwargs) File "/usr/local/lib/python3.9/site-packages/dynamic_network_architectures/architectures/unet.py", line 58, in forward def forward(self, x): File "/usr/local/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py", line 209, in _fn return fn(*args, kwargs) File "/usr/local/lib/python3.9/site-packages/torch/_functorch/aot_autograd.py", line 2819, in forward return compiled_fn(full_args) File "/usr/local/lib/python3.9/site-packages/torch/_functorch/aot_autograd.py", line 1222, in g return f(args) File "/usr/local/lib/python3.9/site-packages/torch/_functorch/aot_autograd.py", line 2386, in debug_compiled_function return compiled_function(args) File "/usr/local/lib/python3.9/site-packages/torch/_functorch/aot_autograd.py", line 1898, in runtime_wrapper all_outs = call_func_with_args( File "/usr/local/lib/python3.9/site-packages/torch/_functorch/aot_autograd.py", line 1247, in call_func_with_args out = normalize_as_list(f(args)) File "/usr/local/lib/python3.9/site-packages/torch/_functorch/aot_autograd.py", line 1222, in g return f(args) File "/usr/local/lib/python3.9/site-packages/torch/autograd/function.py", line 506, in apply return super().apply(args, kwargs) # type: ignore[misc] File "/usr/local/lib/python3.9/site-packages/torch/_functorch/aot_autograd.py", line 2151, in forward fw_outs = call_func_with_args( File "/usr/local/lib/python3.9/site-packages/torch/_functorch/aot_autograd.py", line 1247, in call_func_with_args out = normalize_as_list(f(args)) File "/usr/local/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 248, in run return model(new_inputs) File "/tmp/torchinductor_root/76/c76z3d2vbytq5iohj6abm6ohfqvzfklu3vdqtdyy3dmloqxapdjk.py", line 2126, in call triton__0.run(primals_1, buf0, 864, grid=grid(864), stream=stream0) File "/usr/local/lib/python3.9/site-packages/torch/_inductor/triton_ops/autotune.py", line 190, in run result = launcher( File "", line 6, in launcher File "/usr/local/lib/python3.9/site-packages/triton/compiler.py", line 1679, in getattribute self._init_handles() File "/usr/local/lib/python3.9/site-packages/triton/compiler.py", line 1672, in _init_handles mod, func, n_regs, n_spills = cuda_utils.load_binary(self.metadata["name"], self.asm["cubin"], self.shared, device) RuntimeError: Triton Error [CUDA]: device kernel image is invalid Exception in thread Thread-4: Traceback (most recent call last): File "/usr/local/lib/python3.9/threading.py", line 980, in _bootstrap_inner self.run() File "/usr/local/lib/python3.9/threading.py", line 917, in run self._target(*self._args, **self._kwargs) File "/usr/local/lib/python3.9/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 125, in results_loop raise e File "/usr/local/lib/python3.9/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 103, in results_loop raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the " RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message

I tried setting the following environment variable: os.environ['OMP_NUM_THREADS']= "1" os.environ['nnunet_proc_DA']= "6"

None of them seem working. It's strange because I made others trainings before and everything went alright. Anyone has some suggestion?

julclu commented 1 year ago

Hello,

Same problem as @giuliarubiu above. Did other trainings previously with the exact same code and it was totally fine. Now all of a sudden getting this error, even with setting the thread number to 1 (OMP_NUM_THREADS=1)

rfrs commented 1 year ago

I am also having the same error while running it in an Azure compute instance (12 vCPU cores, 220GB RAM and 1x Nvidia V100 16GB). There are some epochs of training and then it freezes with zero GPU activity. I tried OMP_NUM_THREADS=1 but it still does not work... further suggestions?

ljestaciocerquin commented 1 year ago

Hello, the same problem as @giuliarubiu, I tried OMP_NUM_THREADS=1 and also nnUNet_n_proc_DA=0 but it didn't work. I've tried it in a rtx8000 48GB. Can you please help me to solve this problem?

rfrs commented 1 year ago

i have also tried both of those suggestions without working either :(

MOMOANNIE commented 1 year ago

@FabianIsensee

Hi FabianIsensee, I use nnUnetV2, when I add--c after the training statement to continue training, the above problem occurs: RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message

According to the above methods, none of them have been solved. Is there any other way to solve this problem?

I found that if I don't add --c, use CUDA_VISIBLE_DEVICES=7 nnUNetv2_train 2 3d_fullres 4 --npztraining can be trained normally, but when I run 1000 epochs and I modify the number of epochs to 2000, use-- c continue to train CUDA_VISIBLE_DEVICES=7 nnUNetv2_train 2 3d_fullres 4 --npz --c ,The problem will occur:RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message

lhz1209 commented 1 year ago

我也遇到了类似的问题,然后定位到问题为dockers内的挂载硬盘和存储模型训练数据的硬盘不是同一块,所以当我把他俩统一在一个硬盘上时,这个问题就解决了。

20240307

遇到这个同类型报错,还可以检查一下nnUNet_proprecess文件夹内的预处理数据是否正确【我一般是通过人工看文件大小,要是出现文件大小只有几百kb的或者是0的文件时,就可能说明预处理数据没有跑完全】

YazdanSalimi commented 12 months ago

Thank you for your great project. The problem is still there when doing inference on a large number of datasets. I solved it by adding --c and running the code again. please let me know in case of an update. Thank you.

Awayah commented 9 months ago

Facing the same problem

sevgikafali commented 8 months ago

Hey, is there a solution to this? I am receiving the same error.

rooskraaijveld commented 5 months ago

Same here! I have the same problem

rfrs commented 5 months ago

I still have the same issue from time to time when training a model, even hen using the latest version of nnUNet. I usually run only few epochs at the time, either 50 or 100 at a time since the frameworks saves a checkpoint each 50 epochs and you can continue the training from such.

Just change in the code self.num_epochs = 1000 in the script https://github.com/MIC-DKFZ/nnUNet/blob/master/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py to for example self.num_epochs = 100,

You can then run the code for 100 epochs. Then you change the code to self.num_epochs = 200 and you continue the training with the --c and will keep using the developed model thus far.

A more elegant way is to have: self.num_epochs = int(input())

Hope this helps.

Best

endlesscodinggg commented 3 months ago

If you are running nnUNet in Docker, you may need to set the --shm-size parameter. This is because the default shm-size is 64M, which may not be enough. e.g., nvidia-docker run -it --name xxx --shm-size 32g image_id bash.

goodgoodstudy233 commented 3 months ago

我遇到同样的错误,使用os.environ['OMP_NUM_THREADS']=“1”不行,有可能是线程开太多了,nvidia-smi看不出来,使用 fuser -v /dev/nvidia* |awk '{for(i=1;i<=NF;i++)打印 “kill -9 ” $i;}' |sh 把所有线程杀死,就可以运行了

我也遇到这个问题但是不能解决,请问能详细一点吗

kaident-tr commented 1 month ago

Hi all, may I ask that in case of Window, how we can define the OMP_NUM_THREADS=1? Since after I run this: OMP_NUM_THREADS=1 CUDA_VISIBLE_DEVICES=0 nnUNetv2_train Dataset701_AbdomenCT 2d all -tr nnUNetTrainerUMambaBot -device cuda

It appears that: 'OMP_NUM_THREADS' is not recognized as an internal or external command, operable program or batch file.

niubihonghong12345 commented 1 month ago

我遇到同样的错误,使用os.environ['OMP_NUM_THREADS']=“1”不行,有可能是线程开太多了,nvidia-smi看不出来,使用 fuser -v /dev/nvidia* |awk '{for(i=1;i<=NF;i++)打印 “kill -9 ” $i;}' |sh 把所有线程杀死,就可以运行了

我也遇到这个问题但是不能解决,请问能详细一点吗

解决了吗,我也遇到了同样的问题