bowang-lab / U-Mamba

U-Mamba: Enhancing Long-range Dependency for Biomedical Image Segmentation
https://arxiv.org/abs/2401.04722
Apache License 2.0
638 stars 53 forks source link

RuntimeError: One or more background workers are no longer alive. #54

Open kaident-tr opened 1 month ago

kaident-tr commented 1 month ago

Hi all, when I start training in the Windows environment, I get this error information. Eventhough I have tried the solution from https://github.com/MIC-DKFZ/nnUNet/issues/1343 from the original nnUNet by setting the environment OMP_NUM_THREADS=1, It still not be solved. Thank you in advance for your help!

`This is the configuration used by this training: Configuration name: 2d {'data_identifier': 'nnUNetPlans_2d', 'preprocessor_name': 'DefaultPreprocessor', 'batch_size': 14, 'patch_size': [512, 448], 'median_image_size_in_voxels': [512.0, 512.0], 'spacing': [0.7958984971046448, 0.7958984971046448], 'normalization_schemes': ['CTNormalization'], 'use_mask_for_norm': [False], 'UNet_class_name': 'PlainConvUNet', 'UNet_base_num_features': 32, 'n_conv_per_stage_encoder': [2, 2, 2, 2, 1, 1, 1], 'n_conv_per_stage_decoder': [2, 2, 2, 2, 1, 1], 'num_pool_per_axis': [6, 6], 'pool_op_kernel_sizes': [[1, 1], [2, 2], [2, 2], [2, 2], [2, 2], [2, 2], [2, 2]], 'conv_kernel_sizes': [[3, 3], [3, 3], [3, 3], [3, 3], [3, 3], [3, 3], [3, 3]], 'unet_max_num_features': 512, 'resampling_fn_data': 'resample_data_or_seg_to_shape', 'resampling_fn_seg': 'resample_data_or_seg_to_shape', 'resampling_fn_data_kwargs': {'is_seg': False, 'order': 3, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_seg_kwargs': {'is_seg': True, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_probabilities': 'resample_data_or_seg_to_shape', 'resampling_fn_probabilities_kwargs': {'is_seg': False, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'batch_dice': True}

These are the global plan.json settings: {'dataset_name': 'Dataset701_AbdomenCT', 'plans_name': 'nnUNetPlans', 'original_median_spacing_after_transp': [2.5, 0.7958984971046448, 0.7958984971046448], 'original_median_shape_after_transp': [97, 512, 512], 'image_reader_writer': 'SimpleITKIO', 'transpose_forward': [0, 1, 2], 'transpose_backward': [0, 1, 2], 'experiment_planner_used': 'ExperimentPlanner', 'label_manager': 'LabelManager', 'foreground_intensity_properties_per_channel': {'0': {'max': 3071.0, 'mean': 97.29691314697266, 'median': 118.0, 'min': -1024.0, 'percentile_00_5': -958.0, 'percentile_99_5': 270.0, 'std': 137.85003662109375}}}

2024-07-24 17:20:43.049483: unpacking dataset... 2024-07-24 17:20:43.598747: unpacking done... 2024-07-24 17:20:43.599747: do_dummy_2d_data_aug: False 2024-07-24 17:20:43.666747: Unable to plot network architecture: 2024-07-24 17:20:43.666747: No module named 'hiddenlayer' 2024-07-24 17:20:43.759725: 2024-07-24 17:20:43.760716: Epoch 0 2024-07-24 17:20:43.761715: Current learning rate: 0.01 using pin_memory on device 0 Traceback (most recent call last): File "\?\C:\ProgramData\Anaconda3\envs\umamba\Scripts\nnUNetv2_train-script.py", line 33, in sys.exit(load_entry_point('nnunetv2', 'console_scripts', 'nnUNetv2_train')()) File "f:\u-mamba-main\umamba\nnunetv2\run\run_training.py", line 268, in run_training_entry run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights, File "f:\u-mamba-main\umamba\nnunetv2\run\run_training.py", line 204, in run_training nnunet_trainer.run_training() File "f:\u-mamba-main\umamba\nnunetv2\training\nnUNetTrainer\nnUNetTrainer.py", line 1258, in run_training train_outputs.append(self.train_step(next(self.dataloader_train))) File "f:\u-mamba-main\umamba\nnunetv2\training\nnUNetTrainer\nnUNetTrainer.py", line 900, in train_step output = self.network(data) File "C:\ProgramData\Anaconda3\envs\umamba\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "C:\ProgramData\Anaconda3\envs\umamba\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl return forward_call(*args, *kwargs) File "f:\u-mamba-main\umamba\nnunetv2\nets\UMambaBot_2d.py", line 432, in forward skips[-1] = self.mamba_layer(skips[-1]) File "C:\ProgramData\Anaconda3\envs\umamba\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl return self._call_impl(args, kwargs) File "C:\ProgramData\Anaconda3\envs\umamba\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl return forward_call(*args, kwargs) File "C:\ProgramData\Anaconda3\envs\umamba\lib\site-packages\torch\amp\autocast_mode.py", line 16, in decorate_autocast return func(*args, *kwargs) File "f:\u-mamba-main\umamba\nnunetv2\nets\UMambaBot_2d.py", line 61, in forward x_mamba = self.mamba(x_norm) File "C:\ProgramData\Anaconda3\envs\umamba\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl return self._call_impl(args, kwargs) File "C:\ProgramData\Anaconda3\envs\umamba\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl return forward_call(*args, kwargs) File "C:\ProgramData\Anaconda3\envs\umamba\lib\site-packages\mamba_ssm\modules\mamba_simple.py", line 146, in forward out = mamba_inner_fn( File "C:\ProgramData\Anaconda3\envs\umamba\lib\site-packages\mamba_ssm\ops\selective_scan_interface.py", line 317, in mamba_inner_fn return MambaInnerFn.apply(xz, conv1d_weight, conv1d_bias, x_proj_weight, delta_proj_weight, File "C:\ProgramData\Anaconda3\envs\umamba\lib\site-packages\torch\autograd\function.py", line 539, in apply return super().apply(*args, *kwargs) # type: ignore[misc] File "C:\ProgramData\Anaconda3\envs\umamba\lib\site-packages\torch\cuda\amp\autocast_mode.py", line 113, in decorate_fwd return fwd(args, kwargs) File "C:\ProgramData\Anaconda3\envs\umamba\lib\site-packages\mamba_ssm\ops\selective_scan_interface.py", line 187, in forward conv1d_out = causal_conv1d_cuda.causal_conv1d_fwd( TypeError: causal_conv1d_fwd(): incompatible function arguments. The following argument types are supported:

  1. (arg0: torch.Tensor, arg1: torch.Tensor, arg2: Optional[torch.Tensor], arg3: Optional[torch.Tensor], arg4: bool) -> torch.Tensor

Invoked with: tensor([[[-0.3531, -0.3256, -0.5120, ..., -0.3845, -0.3780, -0.2731], [-0.1226, 0.0515, 0.0443, ..., -0.0484, -0.0954, 0.2243], [ 0.2591, 0.4765, 0.4899, ..., 0.2762, 0.2085, 0.1601], ..., [-0.4706, 0.0122, -0.0670, ..., -0.6855, -1.0694, -0.7547], [ 0.2710, 0.6020, 0.5813, ..., 0.0339, 0.0822, 0.5069], [-0.0817, 0.1549, 0.1879, ..., -0.1216, -0.4358, -0.3873]],

    [[-0.7350, -0.6563, -0.6970,  ..., -0.5548, -0.2491, -0.3194],
     [-0.3465, -0.6268, -0.4854,  ...,  0.2556,  0.1076,  0.1940],
     [ 0.0645,  0.5889,  0.7408,  ...,  0.4412,  0.1118,  0.2022],
     ...,
     [-0.7669, -0.8219, -0.9606,  ..., -0.6517, -0.6021, -0.7447],
     [ 0.6877,  0.3808,  0.4204,  ...,  0.2805,  0.3491,  0.3867],
     [ 0.1577,  0.0902,  0.0191,  ..., -0.5127, -0.3992, -0.4217]],

    [[-0.6899, -0.6800, -0.7939,  ..., -0.2452, -0.2823, -0.2156],
     [-0.2452, -0.2569, -0.4180,  ...,  0.2565,  0.3105,  0.2020],
     [ 0.4328,  0.6825,  0.6242,  ...,  0.2382,  0.2548,  0.2945],
     ...,
     [-0.5348, -0.4934, -0.6218,  ..., -0.8466, -0.8843, -0.9299],
     [ 0.1885,  0.4097,  0.3503,  ...,  0.5430,  0.5202,  0.5581],
     [-0.4576, -0.3852, -0.5572,  ..., -0.4343, -0.5026, -0.4852]],

    ...,

    [[-0.3982, -0.6243, -0.6702,  ..., -0.2997, -0.0544, -0.6496],
     [-0.3635, -0.3576, -0.4177,  ...,  0.1261,  0.1114,  0.0181],
     [ 0.3839,  0.7153,  0.7155,  ...,  0.2303,  0.1457, -0.1998],
     ...,
     [-0.6408, -0.5035, -0.6167,  ..., -0.6473, -0.4699, -0.2966],
     [ 0.3132,  0.4346,  0.4209,  ...,  0.0756,  0.2835,  0.2599],
     [-0.2990, -0.3384, -0.4100,  ...,  0.0843, -0.1040, -0.0645]],

    [[-0.4619, -0.7534, -0.7760,  ..., -0.5952, -0.3705, -0.3551],
     [-0.1528, -0.3495, -0.3650,  ...,  0.0889,  0.2627,  0.0885],
     [ 0.5250,  0.7301,  0.7312,  ...,  0.2815,  0.2979,  0.2394],
     ...,
     [-0.6124, -0.5625, -0.6515,  ..., -0.4177, -0.9805, -0.9586],
     [ 0.3327,  0.3848,  0.4037,  ...,  0.0295,  0.4747,  0.5617],
     [-0.3875, -0.3905, -0.4910,  ..., -0.0437, -0.5517, -0.5322]],

    [[-0.5744, -0.5597, -0.6744,  ..., -0.4591, -0.5266, -0.3234],
     [-0.2457, -0.3103, -0.3841,  ...,  0.0146,  0.0279,  0.0058],
     [ 0.5145,  0.6709,  0.6334,  ...,  0.0854,  0.1010,  0.3496],
     ...,
     [-0.6111, -0.6036, -0.6492,  ..., -0.6807, -0.6825, -0.8804],
     [ 0.2965,  0.4934,  0.4702,  ...,  0.5427,  0.5108,  0.7819],
     [-0.3857, -0.3858, -0.3655,  ..., -0.4994, -0.5220, -0.0722]]],
   device='cuda:0', requires_grad=True), tensor([[ 0.2771, -0.4502,  0.2234,  0.4393],
    [-0.2371,  0.0904,  0.3013,  0.2585],
    [-0.2705,  0.0695,  0.4170, -0.1234],
    ...,
    [ 0.3458, -0.2377, -0.4476,  0.1447],
    [ 0.4869,  0.3001, -0.4930,  0.0575],
    [ 0.4755, -0.2672,  0.3849, -0.0855]], device='cuda:0',
   requires_grad=True), Parameter containing:

tensor([-0.0066, -0.3897, 0.1920, ..., 0.1256, -0.0983, -0.4903], device='cuda:0', requires_grad=True), None, None, None, True Exception in thread Thread-4 (results_loop): Traceback (most recent call last): File "C:\ProgramData\Anaconda3\envs\umamba\lib\threading.py", line 1016, in _bootstrap_inner self.run() File "C:\ProgramData\Anaconda3\envs\umamba\lib\threading.py", line 953, in run self._target(*self._args, **self._kwargs) File "C:\ProgramData\Anaconda3\envs\umamba\lib\site-packages\batchgenerators\dataloading\nondet_multi_threaded_augmenter.py", line 125, in results_loop raise e File "C:\ProgramData\Anaconda3\envs\umamba\lib\site-packages\batchgenerators\dataloading\nondet_multi_threaded_augmenter.py", line 103, in results_loop raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the " RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message`

Saul62 commented 1 month ago

我也遇到同样的问题,请问你是否已经解决?

qxxfd commented 1 month ago

我也遇到同样的问题,请问你是否已经解决?CUDA_VISIBLE_DEVICES=1 nnUNetv2_train 11 3d_fullres 0

############################ INFO: You are using the old nnU-Net default plans. We have updated our recommendations. Please consider using those instead! Read more here: https://github.com/MIC-DKFZ/nnUNet/blob/master/documentation/resenc_presets.md ############################

Using device: cuda:0

####################################################################### Please cite the following paper when using nnU-Net: Isensee, F., Jaeger, P. F., Kohl, S. A., Petersen, J., & Maier-Hein, K. H. (2021). nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods, 18(2), 203-211. #######################################################################

2024-07-28 01:12:55.969214: do_dummy_2d_data_aug: True 2024-07-28 01:12:55.970464: Using splits from existing split file: /gpfs/share/home/2301210659/tools/nnunet_v2/dataset/nnUNet_preprocessed/Dataset011_T-tubule/splits_final.json 2024-07-28 01:12:55.971022: The split file contains 5 splits. 2024-07-28 01:12:55.971232: Desired fold for training: 0 2024-07-28 01:12:55.971411: This split has 4 training and 1 validation cases. using pin_memory on device 0 Exception in background worker 3: local variable 'region_labels' referenced before assignment Traceback (most recent call last): File "/gpfs/share/home/2301210659/.conda/envs/nnunetv2/lib/python3.9/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 53, in producer item = next(data_loader) File "/gpfs/share/home/2301210659/.conda/envs/nnunetv2/lib/python3.9/site-packages/batchgenerators/dataloading/data_loader.py", line 126, in next return self.generate_train_batch() File "/gpfs/share/home/2301210659/tools/nnunet_v2/nnUNet/nnunetv2/training/dataloading/data_loader_3d.py", line 61, in generate_train_batch tmp = self.transforms({'image': data_all[b], 'segmentation': seg_all[b]}) File "/gpfs/share/home/2301210659/.conda/envs/nnunetv2/lib/python3.9/site-packages/batchgeneratorsv2/transforms/base/basic_transform.py", line 18, in call return self.apply(data_dict, params) File "/gpfs/share/home/2301210659/.conda/envs/nnunetv2/lib/python3.9/site-packages/batchgeneratorsv2/transforms/utils/compose.py", line 13, in apply data_dict = t(data_dict) File "/gpfs/share/home/2301210659/.conda/envs/nnunetv2/lib/python3.9/site-packages/batchgeneratorsv2/transforms/base/basic_transform.py", line 18, in call return self.apply(data_dict, params) File "/gpfs/share/home/2301210659/.conda/envs/nnunetv2/lib/python3.9/site-packages/batchgeneratorsv2/transforms/base/basic_transform.py", line 67, in apply data_dict['segmentation'] = self._apply_to_segmentation(data_dict['segmentation'], **params) File "/gpfs/share/home/2301210659/.conda/envs/nnunetv2/lib/python3.9/site-packages/batchgeneratorsv2/transforms/utils/seg_to_regions.py", line 17, in _apply_to_segmentation if isinstance(region_labels, int) or len(region_labels) == 1: UnboundLocalError: local variable 'region_labels' referenced before assignment Traceback (most recent call last): File "/gpfs/share/home/2301210659/.conda/envs/nnunetv2/bin/nnUNetv2_train", line 8, in sys.exit(run_training_entry()) File "/gpfs/share/home/2301210659/tools/nnunet_v2/nnUNet/nnunetv2/run/run_training.py", line 275, in run_training_entry run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights, File "/gpfs/share/home/2301210659/tools/nnunet_v2/nnUNet/nnunetv2/run/run_training.py", line 211, in run_training nnunet_trainer.run_training() File "/gpfs/share/home/2301210659/tools/nnunet_v2/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 1362, in run_training self.on_train_start() File "/gpfs/share/home/2301210659/tools/nnunet_v2/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 903, in on_train_start self.dataloader_train, self.dataloader_val = self.get_dataloaders() File "/gpfs/share/home/2301210659/tools/nnunet_v2/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 696, in getdataloaders = next(mt_gen_train) File "/gpfs/share/home/2301210659/.conda/envs/nnunetv2/lib/python3.9/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 196, in next item = self.__get_next_item() File "/gpfs/share/home/2301210659/.conda/envs/nnunetv2/lib/python3.9/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 181, in __get_next_item raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the " RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message

kaident-tr commented 1 month ago

我也遇到同样的问题,请问你是否已经解决?CUDA_VISIBLE_DEVICES=1 nnUNetv2_train 11 3d_fullres 0

############################ INFO: You are using the old nnU-Net default plans. We have updated our recommendations. Please consider using those instead! Read more here: https://github.com/MIC-DKFZ/nnUNet/blob/master/documentation/resenc_presets.md ############################

Using device: cuda:0

####################################################################### Please cite the following paper when using nnU-Net: Isensee, F., Jaeger, P. F., Kohl, S. A., Petersen, J., & Maier-Hein, K. H. (2021). nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods, 18(2), 203-211. #######################################################################

2024-07-28 01:12:55.969214: do_dummy_2d_data_aug: True 2024-07-28 01:12:55.970464: Using splits from existing split file: /gpfs/share/home/2301210659/tools/nnunet_v2/dataset/nnUNet_preprocessed/Dataset011_T-tubule/splits_final.json 2024-07-28 01:12:55.971022: The split file contains 5 splits. 2024-07-28 01:12:55.971232: Desired fold for training: 0 2024-07-28 01:12:55.971411: This split has 4 training and 1 validation cases. using pin_memory on device 0 Exception in background worker 3: local variable 'region_labels' referenced before assignment Traceback (most recent call last): File "/gpfs/share/home/2301210659/.conda/envs/nnunetv2/lib/python3.9/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 53, in producer item = next(data_loader) File "/gpfs/share/home/2301210659/.conda/envs/nnunetv2/lib/python3.9/site-packages/batchgenerators/dataloading/data_loader.py", line 126, in next return self.generate_train_batch() File "/gpfs/share/home/2301210659/tools/nnunet_v2/nnUNet/nnunetv2/training/dataloading/data_loader_3d.py", line 61, in generate_train_batch tmp = self.transforms({'image': data_all[b], 'segmentation': seg_all[b]}) File "/gpfs/share/home/2301210659/.conda/envs/nnunetv2/lib/python3.9/site-packages/batchgeneratorsv2/transforms/base/basic_transform.py", line 18, in call return self.apply(data_dict, params) File "/gpfs/share/home/2301210659/.conda/envs/nnunetv2/lib/python3.9/site-packages/batchgeneratorsv2/transforms/utils/compose.py", line 13, in apply data_dict = t(data_dict) File "/gpfs/share/home/2301210659/.conda/envs/nnunetv2/lib/python3.9/site-packages/batchgeneratorsv2/transforms/base/basic_transform.py", line 18, in call return self.apply(data_dict, params) File "/gpfs/share/home/2301210659/.conda/envs/nnunetv2/lib/python3.9/site-packages/batchgeneratorsv2/transforms/base/basic_transform.py", line 67, in apply data_dict['segmentation'] = self._apply_to_segmentation(data_dict['segmentation'], params) File "/gpfs/share/home/2301210659/.conda/envs/nnunetv2/lib/python3.9/site-packages/batchgeneratorsv2/transforms/utils/seg_to_regions.py", line 17, in _apply_to_segmentation if isinstance(region_labels, int) or len(region_labels) == 1: UnboundLocalError: local variable 'region_labels' referenced before assignment Traceback (most recent call last): File "/gpfs/share/home/2301210659/.conda/envs/nnunetv2/bin/nnUNetv2_train", line 8, in sys.exit(run_training_entry()) File "/gpfs/share/home/2301210659/tools/nnunet_v2/nnUNet/nnunetv2/run/run_training.py", line 275, in run_training_entry run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights, File "/gpfs/share/home/2301210659/tools/nnunet_v2/nnUNet/nnunetv2/run/run_training.py", line 211, in run_training nnunet_trainer.run_training() File "/gpfs/share/home/2301210659/tools/nnunet_v2/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 1362, in run_training self.on_train_start() File "/gpfs/share/home/2301210659/tools/nnunet_v2/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 903, in on_train_start self.dataloader_train, self.dataloader_val = self.get_dataloaders() File "/gpfs/share/home/2301210659/tools/nnunet_v2/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 696, in getdataloaders = next(mt_gen_train) File "/gpfs/share/home/2301210659/.conda/envs/nnunetv2/lib/python3.9/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 196, in next** item = self.__get_next_item() File "/gpfs/share/home/2301210659/.conda/envs/nnunetv2/lib/python3.9/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 181, in __get_next_item raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the " RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message

我已经解决问题,我想应该很多问题最终都归于“One or more background workers...",所以你尝试追踪到上面的traceback. 我的问题是重新安装回那些需要的package. 不如你尝试3.10版本吧(因为我看见作者推荐3.10版本)

AyacodeYa commented 1 month ago

Hi everyone, maybe this can help you. https://github.com/bowang-lab/U-Mamba/issues/56#issuecomment-2288217162