MIC-DKFZ / nnUNet

Apache License 2.0
5.79k stars 1.74k forks source link

Azure training #1594

Closed rfrs closed 5 months ago

rfrs commented 1 year ago

Dear all,

I have been testing the nnUnet pipeline in Azure. I am performing preliminary tests with N=25 3d images, in a compute instance with 200GB of RAM and a Nvidia V100 with 16GB of vRAM. Nonetheless, during the trainig step, it runs for a few epochs and stops... but does not provide any error message... does not seem to be a vRAM issue as it stays steady at about 50-60% of usage neither it seems a RAM issue. Any ideas to troubleshoot this?

I am getting the error: One or more background workers are no longer alive. Exiting. Please check the " RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message

I tried to OMP_NUM_THREADS=1 but i still keep getting an error. Ideas?

Thanks. Best Rui

ykirchhoff commented 1 year ago

Hi Rui,

could you please post the lines before that error message? You can usually find the problem somewhere in there.

Best, Yannick

rfrs commented 1 year ago

Hi Yannick, thanks for reaching out. The message is as follows:

File "/anaconda/envs/nnunet2_py39/bin/nnUNetv2_train", line 8, in sys.exit(run_training_entry()) File "/anaconda/envs/nnunet2_py39/lib/python3.9/site-packages/nnunetv2/run/run_training.py", line 252, in run_training_entry run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights, File "/anaconda/envs/nnunet2_py39/lib/python3.9/site-packages/nnunetv2/run/run_training.py", line 195, in run_training nnunet_trainer.run_training() File "/anaconda/envs/nnunet2_py39/lib/python3.9/site-packages/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 1211, in run_training train_outputs.append(self.train_step(next(self.dataloader_train))) File "/anaconda/envs/nnunet2_py39/lib/python3.9/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 196, in next item = self.get_next_item() File "/anaconda/envs/nnunet2_py39/lib/python3.9/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 188, in get_next_item sleep(self.wait_time) KeyboardInterrupt Exception in thread Thread-4: Traceback (most recent call last): File "/anaconda/envs/nnunet2_py39/lib/python3.9/threading.py", line 980, in _bootstrap_inner self.run() File "/anaconda/envs/nnunet2_py39/lib/python3.9/threading.py", line 917, in run self._target(*self._args, **self._kwargs) File "/anaconda/envs/nnunet2_py39/lib/python3.9/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 125, in results_loop raise e File "/anaconda/envs/nnunet2_py39/lib/python3.9/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 103, in results_loop raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the " RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message

Thanks. Best Rui

ykirchhoff commented 1 year ago

Hi Rui,

thanks for the complete error. There is a KeyboardInterrupt in your error, which if you didn't trigger it yourself is probably triggered by Azure for some reason. As RAM and vRAM shouldn't be a problem it might be due to CPU usage, did you check that? You can adjust the number of processes used for training with nnUNet_n_proc_DA, which defaults to 12. Setting nnUNet_n_proc_DA=0 will give you single threaded data augmentation, which is generally speaking slower but uses much less CPU. If that works try increasing nnUNet_n_proc_DA.

Hope this helps. Best, Yannick

PS: nnUNet_n_proc_DA can be adjusted by simply doing nnUNet_n_proc_DA=0 nnUNetv2_train ...

rfrs commented 1 year ago

Hi Yannick, thanks for the answer, unfortunately it did not work. The code runs for a few epochs just to stop again. I print the log bellow. Since i have 12 vCPU available in the environment i set nnUNet_n_proc_DA=10 and it failed. It was really really slow when thread count was set to 0 or to 1.


$ nnUNet_n_proc_DA=10 nnUNetv2_train 200 2d 0 --npz Using device: cuda:0

####################################################################### Please cite the following paper when using nnU-Net: Isensee, F., Jaeger, P. F., Kohl, S. A., Petersen, J., & Maier-Hein, K. H. (2021). nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods, 18(2), 203-211. #######################################################################

This is the configuration used by this training: Configuration name: 2d {'data_identifier': 'nnUNetPlans_2d', 'preprocessor_name': 'DefaultPreprocessor', 'batch_size': 13, 'patch_size': [512, 448], 'median_image_size_in_voxels': [512.0, 512.0], 'spacing': [0.767578125, 0.767578125], 'normalization_schemes': ['CTNormalization'], 'use_mask_for_norm': [False], 'UNet_class_name': 'PlainConvUNet', 'UNet_base_num_features': 32, 'n_conv_per_stage_encoder': [2, 2, 2, 2, 2, 2, 2], 'n_conv_per_stage_decoder': [2, 2, 2, 2, 2, 2], 'num_pool_per_axis': [6, 6], 'pool_op_kernel_sizes': [[1, 1], [2, 2], [2, 2], [2, 2], [2, 2], [2, 2], [2, 2]], 'conv_kernel_sizes': [[3, 3], [3, 3], [3, 3], [3, 3], [3, 3], [3, 3], [3, 3]], 'unet_max_num_features': 512, 'resampling_fn_data': 'resample_data_or_seg_to_shape', 'resampling_fn_seg': 'resample_data_or_seg_to_shape', 'resampling_fn_data_kwargs': {'is_seg': False, 'order': 3, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_seg_kwargs': {'is_seg': True, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_probabilities': 'resample_data_or_seg_to_shape', 'resampling_fn_probabilities_kwargs': {'is_seg': False, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'batch_dice': True}

These are the global plan.json settings: {'dataset_name': 'Dataset200_spine', 'plans_name': 'nnUNetPlans', 'original_median_spacing_after_transp': [1.0, 0.767578125, 0.767578125], 'original_median_shape_after_transp': [541, 512, 512], 'image_reader_writer': 'SimpleITKIO', 'transpose_forward': [0, 1, 2], 'transpose_backward': [0, 1, 2], 'experiment_planner_used': 'ExperimentPlanner', 'label_manager': 'LabelManager', 'foreground_intensity_properties_per_channel': {'0': {'max': 2663.0, 'mean': 329.7226257324219, 'median': 261.0, 'min': -941.0, 'percentile_00_5': -2.0, 'percentile_99_5': 1206.0, 'std': 241.9765625}}}

2023-08-04 13:28:13.893867: unpacking dataset... 2023-08-04 13:28:17.474627: unpacking done... 2023-08-04 13:28:17.509105: do_dummy_2d_data_aug: False 2023-08-04 13:28:17.551622: Using splits from existing split file: nnUNet_models/spine1K_n25_2xV100_test/nnUNet_preprocessed/Dataset200_spine/splits_final.json 2023-08-04 13:28:17.568009: The split file contains 5 splits. 2023-08-04 13:28:17.576300: Desired fold for training: 0 2023-08-04 13:28:17.584742: This split has 20 training and 5 validation cases. 2023-08-04 13:28:19.093326: Unable to plot network architecture: 2023-08-04 13:28:19.168411: module 'torch.jit' has no attribute 'get_trace_graph' 2023-08-04 13:28:19.315055: 2023-08-04 13:28:19.323275: Epoch 0 2023-08-04 13:28:19.332131: Current learning rate: 0.01 using pin_memory on device 0 using pin_memory on device 0 /anaconda/envs/nnunet2_py39/lib/python3.9/site-packages/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py:970: RuntimeWarning: invalid value encountered in scalar divide global_dc_per_class = [i for i in [2 i / (2 i + j + k) for i, j, k in 2023-08-04 13:30:11.100131: train_loss 0.2669 2023-08-04 13:30:11.193105: val_loss 0.0517 2023-08-04 13:30:11.205193: Pseudo dice [0.0, nan, nan, nan, nan, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 2023-08-04 13:30:11.225502: Epoch time: 111.79 s 2023-08-04 13:30:11.235616: Yayy! New best EMA pseudo Dice: 0.0 2023-08-04 13:30:14.811116: 2023-08-04 13:30:14.833648: Epoch 1 2023-08-04 13:30:14.842434: Current learning rate: 0.00999 2023-08-04 13:31:42.493429: train_loss 0.0481 2023-08-04 13:31:42.638597: val_loss 0.035 2023-08-04 13:31:42.673699: Pseudo dice [nan, nan, nan, nan, nan, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0603, 0.0, 0.0, 0.0, 0.0] 2023-08-04 13:31:42.690906: Epoch time: 87.68 s 2023-08-04 13:31:42.706916: Yayy! New best EMA pseudo Dice: 0.0003 2023-08-04 13:31:46.506501: 2023-08-04 13:31:46.524667: Epoch 2 2023-08-04 13:31:46.532792: Current learning rate: 0.00998 2023-08-04 13:33:14.203712: train_loss 0.0312 2023-08-04 13:33:14.317371: val_loss 0.0204 2023-08-04 13:33:14.334751: Pseudo dice [nan, nan, nan, nan, nan, nan, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0084, 0.0, 0.0992, 0.0003, 0.0, 0.0, 0.0] 2023-08-04 13:33:14.359694: Epoch time: 87.7 s 2023-08-04 13:33:14.376047: Yayy! New best EMA pseudo Dice: 0.0009 2023-08-04 13:33:18.812734: 2023-08-04 13:33:18.831534: Epoch 3 2023-08-04 13:33:18.841416: Current learning rate: 0.00997 2023-08-04 13:34:49.909967: train_loss 0.0225 2023-08-04 13:34:50.001015: val_loss 0.0122 2023-08-04 13:34:50.067200: Pseudo dice [nan, nan, nan, nan, nan, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0391, 0.0, 0.0227, 0.0193, 0.1111, 0.0, 0.0] 2023-08-04 13:34:50.076057: Epoch time: 91.1 s 2023-08-04 13:34:50.103311: Yayy! New best EMA pseudo Dice: 0.0018 2023-08-04 13:34:53.981759: 2023-08-04 13:34:54.005108: Epoch 4 2023-08-04 13:34:54.025370: Current learning rate: 0.00996 2023-08-04 13:36:18.416863: train_loss 0.0137 2023-08-04 13:36:18.559083: val_loss 0.0065 2023-08-04 13:36:18.572116: Pseudo dice [nan, nan, nan, nan, nan, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.1368, 0.0, 0.1498, 0.0, 0.0] 2023-08-04 13:36:18.589587: Epoch time: 84.44 s 2023-08-04 13:36:18.598529: Yayy! New best EMA pseudo Dice: 0.0031 2023-08-04 13:36:22.174985: 2023-08-04 13:36:22.198013: Epoch 5 2023-08-04 13:36:22.206127: Current learning rate: 0.00995 2023-08-04 13:37:49.409249: train_loss 0.0064 2023-08-04 13:37:49.512477: val_loss -0.0033 2023-08-04 13:37:49.527299: Pseudo dice [nan, nan, nan, nan, nan, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.1247, 0.0, 0.0, 0.0, 0.2593, 0.1335] 2023-08-04 13:37:49.554250: Epoch time: 87.24 s 2023-08-04 13:37:49.564446: Yayy! New best EMA pseudo Dice: 0.0055 2023-08-04 13:37:52.940592: 2023-08-04 13:37:52.959001: Epoch 6 2023-08-04 13:37:52.967198: Current learning rate: 0.00995 2023-08-04 13:39:19.627892: train_loss -0.0041 2023-08-04 13:39:19.742339: val_loss -0.0134 2023-08-04 13:39:19.760233: Pseudo dice [nan, nan, nan, nan, nan, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0072, 0.0, 0.0, 0.0002, 0.3249, 0.0, 0.1936] 2023-08-04 13:39:19.773959: Epoch time: 86.69 s 2023-08-04 13:39:19.798073: Yayy! New best EMA pseudo Dice: 0.0078 2023-08-04 13:39:23.230338: 2023-08-04 13:39:23.269459: Epoch 7 2023-08-04 13:39:23.277954: Current learning rate: 0.00994 2023-08-04 13:40:53.870357: train_loss -0.0171 2023-08-04 13:40:53.989840: val_loss -0.0252 2023-08-04 13:40:54.008450: Pseudo dice [nan, nan, nan, nan, nan, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0125, 0.0024, 0.0155, 0.0, 0.3599, 0.0455, 0.18] 2023-08-04 13:40:54.032427: Epoch time: 90.64 s 2023-08-04 13:40:54.055148: Yayy! New best EMA pseudo Dice: 0.0102 2023-08-04 13:40:57.568206: 2023-08-04 13:40:57.590469: Epoch 8 2023-08-04 13:40:57.598779: Current learning rate: 0.00993 2023-08-04 13:42:24.810461: train_loss -0.0314 2023-08-04 13:42:24.915588: val_loss -0.039 2023-08-04 13:42:24.937008: Pseudo dice [nan, nan, nan, nan, nan, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.093, 0.011, 0.3005, 0.2939, 0.4449, 0.0002, 0.1842] 2023-08-04 13:42:24.950041: Epoch time: 87.24 s 2023-08-04 13:42:24.966740: Yayy! New best EMA pseudo Dice: 0.0162 2023-08-04 13:42:28.827302: 2023-08-04 13:42:28.846198: Epoch 9 2023-08-04 13:42:28.854600: Current learning rate: 0.00992 ^CTraceback (most recent call last): File "/anaconda/envs/nnunet2_py39/bin/nnUNetv2_train", line 8, in sys.exit(run_training_entry()) File "/anaconda/envs/nnunet2_py39/lib/python3.9/site-packages/nnunetv2/run/run_training.py", line 252, in run_training_entry run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights, File "/anaconda/envs/nnunet2_py39/lib/python3.9/site-packages/nnunetv2/run/run_training.py", line 195, in run_training nnunet_trainer.run_training() File "/anaconda/envs/nnunet2_py39/lib/python3.9/site-packages/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 1211, in run_training train_outputs.append(self.train_step(next(self.dataloader_train))) File "/anaconda/envs/nnunet2_py39/lib/python3.9/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 196, in next item = self.get_next_item() File "/anaconda/envs/nnunet2_py39/lib/python3.9/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 188, in get_next_item sleep(self.wait_time) KeyboardInterrupt Exception in thread Thread-4: Traceback (most recent call last): File "/anaconda/envs/nnunet2_py39/lib/python3.9/threading.py", line 980, in _bootstrap_inner self.run() File "/anaconda/envs/nnunet2_py39/lib/python3.9/threading.py", line 917, in run self._target(*self._args, **self._kwargs) File "/anaconda/envs/nnunet2_py39/lib/python3.9/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 125, in results_loop raise e File "/anaconda/envs/nnunet2_py39/lib/python3.9/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 103, in results_loop raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the " RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message


Do you have any further advices? We really would like to have this running in the cloud especially since we don't have beefy enough GPUs on site.

Thanks Best Rui

ykirchhoff commented 1 year ago

Hi Rui,

yeah training times become a pain when setting low nnUNet_n_proc_DA. It is kind of weird that the error appears only after a few epochs. It seems to be triggered by Azure for some reason, so if it is neither RAM, VRAM or CPU I am a bit out of ideas here. The only other thing I can think of right now is some kind of timeout. Does it crash consistently after 9 epochs? Do other algorithms work fine on Azure?

Best, Yannick

rfrs commented 1 year ago

Hey Yannick, No it is not consistent, as it has crashed at both epoch 1 as epoch 9, etc. Do you have any reccomendation for deploying nnUNET in cloud computing, i.e., specs for the compute instance to use?

Thanks for all Best Rui

ykirchhoff commented 1 year ago

Hi Rui,

no problem at all and sorry that I couldn't help more. I will try to find out more about your problem. Regarding specs, everything with at least 10GB VRAM, preferably 6/12 cores/threads and 32GB or better 64GB of RAM should be more than enough. So nearly everything that is available for cloud computing ;). That is also why I am kind of confused about your problems with Azure. If you just want to try out and test the nnUNet pipeline you can have a look at the nnUNet workshop, which details how to setup nnUNet in Google Colab. Just take the parts you need and use your own data. With the free option you can run trainings for 12 hours if I am correct, so probably not enough for training of a complete model but good enough for some initial tests without wasting money on compute instances where trainings just fail.

Best, Yannick

ykirchhoff commented 1 year ago

Hey Rui,

something came up in a discussion, are you running nnUNet in a docker container or similar on Azure? With docker you have to set --ipc=host or --shm-size=8GB in your docker run command (8GB just as an example), otherwise you run into issues with transferring data between the main training process and the processes used for dataloading and augmentation.

Best, Yannick

rfrs commented 1 year ago

Hi Yannick,

Do i set this when i do the training? Meaning

nnUNetv2_train ... --ipc=host

Thanks

Best Rui

FabianIsensee commented 1 year ago

Hey Rui, @ykirchhoff is talking about docker containers. If you use nnU-Net inside a docker container you need to give it sufficient shared memory in order for it to run properly. This is because the communication of tensors between python processes requires that. Exceeding the available shared memory will cause processes to be killed, which manifests itself in the same symptoms you also have here. I don't know how azure is handling all of that though, so I cannot say whether this is the same problem or something different. One thing we did in nnU-Net was changing the start method of workers from fork to spawn. This could potentially cause problems for you. You can search and replace all multiprocessing.get_context("spawn").Pool code with multiprocessing.get_context("fork").Pool and try again. Please also set OMP_NUM_THREADS=1 in your environment. If you figure out what the problem is, please share it with us so that we can help others (#1343) in the future. Best, Fabian

PS: Maybe the following helps you out as well: The error you have means that one of the background workers is no longer alive. Since it didn't print you a proper error message (just a KeyboardInterrupt) my hypothesis is that the OS killed the worker for some reason. This can be because of a variety of things (exceeding shared memory is one), and maybe investigating that can help you get to the correct answer

rfrs commented 1 year ago

Hey

Thanks for the suggestions. I am running it in Azure in a conda environment. Should be a similar environment as in google cloud. I have not installed it via docker. I doubt there are RAM issues as i have 220 GB available of RAM and i have been monitoring it closely. What i notice while monitoring id that the CPU cores assigned to nnUNET stop working and become idle and the GPU also stops.

Also, to test this multiprocessing.get_context("spawn").Pool to multiprocessing.get_context("fork").Pool which script i need to access?

Thanks i will keep you posted.

Best Rui

FabianIsensee commented 1 year ago

You need to replace all occurrences of multiprocessing.get_context("spawn").Pool with multiprocessing.get_context("fork").Pool. Many files are affected. Best to do this via a proper IDE like pycharm (ctrl+shift+r)

rfrs commented 1 year ago

Hi Fabian and Yannick...

Something happened and nnUnet worked just fine! I was going through the scripts and found the configuration.py under utilities. I changed default_num_processes = 8 to default_num_processes = 6 (since i have vCPUs = 6 in the current Azure/cloud compute instance) and it ran without a problem.

Also by using nnUNet_keep_files_open=True nnUNet_compile=True nnUNetv2_train... i could make it run about 3x faster.

ykirchhoff commented 1 year ago

Hi Rui,

that is quite unexpected, default_num_processes should during training, iirc, only be used in unpacking the data and exporting the segmentations in the final validation run at the end of the training. So no obvious reason to me, why it should change anything during training itself. And with 6 vCPUs default_num_processes=8 should still be no problem at all. But that might be an issue to further investigate. But I am glad it works now! Have fun playing around with nnUNet and let us know if you manage to break it again 😜

For nnUNet_keep_files_open=True and nnUNet_compile=True that is expected to significantly improve training performance. nnUNet_compile just enables torch.compile and nnUNet_keep_files_open reduces the amount of memory accesses, which for some datasets becomes a major bottleneck.

Best, Yannick

rfrs commented 1 year ago

Thanks for all so far - i will keep you posted :) as i am also going to test different compute instances in the cloud.

Best Rui

rfrs commented 1 year ago

By applying this, i could do the training but there is still a big but ... at the end of the training and immediate fold prediction, there is a huge RAM usage spike and then again i have the same "workers have dided" issue... Do you have any suggestions?

Thanks a lot for your help so far.

rfrs commented 1 year ago

Maybe better to summarize ... I replaced all multiprocessing.get_context("spawn").Pool with multiprocessing.get_context("fork").Pool and now training is done without workers dying. Well i also set OMP_NUM_THREADS=1 and nnUNet_n_proc_DA=4 (out of 6 cores) or nnUNet_n_proc_DA=18 (out of a 24 cores system). The final stage of training - prediction of fold is accompanied by a huge spike in RAM usage until the process dies...

The latest infos nnUnet gives back is:

2023-08-21 14:22:19.858534: Using splits from existing split file: ...
2023-08-21 14:22:19.908329: The split file contains 5 splits.
2023-08-21 14:22:19.916997: Desired fold for training: 0
2023-08-21 14:22:19.925106: This split has 20 training and 5 validation cases.
2023-08-21 14:22:20.080873: predicting spine1Kreduced_008
2023-08-21 14:22:32.057138: predicting spine1Kreduced_013
2023-08-21 14:22:38.197241: predicting spine1Kreduced_016
2023-08-21 14:22:45.114269: predicting spine1Kreduced_019
2023-08-21 14:22:50.689933: predicting spine1Kreduced_021

And it gets stuck consuming RAM till it boggles down .... Suggestions?

ykirchhoff commented 1 year ago

Hey Rui,

glad to hear that at least the training itself is now working. Your training crashes during the final validation nnUNet does. Do you get any of the final predictions in the validation folder? The problem might be with the segmentation export done here, there can be spikes in RAM usage, especially if your images are large. You could try reducing default_num_processes here further. This will not influence anything else like dataloading but is just used for unpacking of the data and exporting the segmentations.

Best, Yannick

rfrs commented 1 year ago

Hi Yannick, once more thanks for the reply.

From the 5 files to be predicted, i find predictions in the validation folder for 13, 16, 19 and 21, but not for 8 - perhaps something is wrong there. I checked the file and it is not corrupted. All files are about the same size: between 150-250 mb.

You recommend using default_num_processes only for inference/validation? Chnage it from 3 to 2 for example? Thanks

Best wishes Rui

ykirchhoff commented 1 year ago

Hi Rui,

the issue then probably is the case 8. What are the shapes and spacings of the 5 cases? I would assume that case 8 is exceptionally large. You can change default_num_processes in configuration.py as you did before. Just decrease it until it works or you reach 1 (hopefully it works before that). It won't have any other effects during training. What is your current value for default_num_processes?

Best, Yannick

rfrs commented 1 year ago

Before i test that, is there a way to prevent the validation step during training?

I had set up nnUNet_n_proc_DA=15 since i have 24 vCPU cores, 220GB of RAM and a A100 with 80GB of vRAM.

rfrs commented 1 year ago

Regarding the image dimensions, i print here the headers of image 8 (not predicted by nnUNEt) and image 16... they are abou the same size though...

Image 8 header is:
 <class 'nibabel.nifti1.Nifti1Header'> object, endian='<'
sizeof_hdr      : 348
data_type       : b''
db_name         : b''
extents         : 0
session_error   : 0
regular         : b'r'
dim_info        : 0
dim             : [  3 512 512 541   1   1   1   1]
intent_p1       : 0.0
intent_p2       : 0.0
intent_p3       : 0.0
intent_code     : none
datatype        : float32
bitpix          : 32
slice_start     : 0
pixdim          : [1.         0.84765625 0.84765625 1.         0.         0.
 0.         0.        ]
vox_offset      : 0.0
scl_slope       : nan
scl_inter       : nan
slice_end       : 0
slice_code      : unknown
xyzt_units      : 10
cal_max         : 0.0
cal_min         : 0.0
slice_duration  : 0.0
toffset         : 0.0
glmax           : 0
glmin           : 0
descrip         : b'5.0.10'
aux_file        : b'80ml Imeron 400 Ven s G'
qform_code      : scanner
sform_code      : scanner
quatern_b       : 0.0
quatern_c       : 0.0
quatern_d       : 0.0
qoffset_x       : -228.57617
qoffset_y       : -70.57617
qoffset_z       : -596.6
srow_x          : [   0.84765625    0.            0.         -228.57617   ]
srow_y          : [  0.           0.84765625   0.         -70.57617   ]
srow_z          : [   0.     0.     1.  -596.6]
intent_name     : b''
magic           : b'n+1'
Image 16 header is:
 <class 'nibabel.nifti1.Nifti1Header'> object, endian='<'
sizeof_hdr      : 348
data_type       : b''
db_name         : b''
extents         : 0
session_error   : 0
regular         : b'r'
dim_info        : 0
dim             : [  3 512 512 547   1   1   1   1]
intent_p1       : 0.0
intent_p2       : 0.0
intent_p3       : 0.0
intent_code     : none
datatype        : float32
bitpix          : 32
slice_start     : 0
pixdim          : [1.        0.7519531 0.7519531 0.799988  0.        0.        0.
 0.       ]
vox_offset      : 0.0
scl_slope       : nan
scl_inter       : nan
slice_end       : 0
slice_code      : unknown
xyzt_units      : 10
cal_max         : 0.0
cal_min         : 0.0
slice_duration  : 0.0
toffset         : 0.0
glmax           : 0
glmin           : 0
descrip         : b'5.0.10'
aux_file        : b'1 mm pv RoutineOPTAbdom'
qform_code      : scanner
sform_code      : scanner
quatern_b       : 0.0
quatern_c       : 0.0
quatern_d       : 0.0
qoffset_x       : -181.74805
qoffset_y       : -331.74805
qoffset_z       : -542.8
srow_x          : [   0.7519531    0.           0.        -181.74805  ]
srow_y          : [   0.           0.7519531    0.        -331.74805  ]
srow_z          : [   0.          0.          0.799988 -542.8     ]
intent_name     : b''
magic           : b'n+1'
ykirchhoff commented 1 year ago

Hey,

there is no setting to disable the validation step at each epoch, you would have to write your own trainer class and set it yourself in the run_training method. The validation at the end can also by default not be skipped. You would again have to modify it yourself here. I meant default_num_processes not nnUNet_n_proc_DA as it should not be related to data augmentation here but the processing of the files for prediction/segmentation export.

The files really look similar from the metadata. I am not quite sure what is going on there. Is there a chance that you could send me the dataset and trained model so I can try it out on my machine?

Best, Yannick

rfrs commented 1 year ago

Hi,

I could "supress" the final validation step and could run both the training and the prediction without issues in Azure. Off course i changed throughout the codemultiprocessing.get_context("spawn").Pool with multiprocessing.get_context("fork").Pool. This worked well for a test dataset of N=25... But i am having again the issue of workers going offline with an N=200... Any further suggestions?

Best Rui

ykirchhoff commented 1 year ago

Hi Rui,

just so I understand you correctly, the issue appears when you run nnUNetv2_predict ... on the test set with N=200 but works on N=25? There is probably some problem with predicted files waiting to be saved and filling up your RAM. I will take a more detailed look into the part of the code hopefully tomorrow and will come back to you after that.

Best, Yannick

rfrs commented 1 year ago

Dear Yannick, thanks for the reply.

I apologise for the confusion, but i refer again to nnUNetv2_train... but in this case, where the trainset is much larger, from 25 to 200 images, and once more, workers stop and training halts.

Thanks Best Rui

rfrs commented 1 year ago

As an example, just during the day i could do nnUNet_keep_files_open=True nnUNet_compile=True nnUNetv2_train 251 2d all -device cuda and nnUNet_keep_files_open=True nnUNet_compile=True nnUNetv2_train 251 3d_fullres all -device cuda. Training ran fine for 500 epochs as i had set up. Dataset ID 251 is composed of the first 100 files (about 2.5 GB) from TotalSegmentator and i am training for 24 classes (all vertebrae).

Nonetheless, i have now tried to run the same, for the first 250 cases (about 5 GB) of the totalsegmentator dataset and it failed to start.

`This is the configuration used by this training:
Configuration name: 2d
 {'data_identifier': 'nnUNetPlans_2d', 'preprocessor_name': 'DefaultPreprocessor', 'batch_size': 46, 'patch_size': [256, 256], 'median_image_size_in_voxels': [244.5, 253.0], 'spacing': [1.5, 1.5], 'normalization_schemes': ['CTNormalization'], 'use_mask_for_norm': [False], 'UNet_class_name': 'PlainConvUNet', 'UNet_base_num_features': 32, 'n_conv_per_stage_encoder': [2, 2, 2, 2, 2, 2, 2], 'n_conv_per_stage_decoder': [2, 2, 2, 2, 2, 2], 'num_pool_per_axis': [6, 6], 'pool_op_kernel_sizes': [[1, 1], [2, 2], [2, 2], [2, 2], [2, 2], [2, 2], [2, 2]], 'conv_kernel_sizes': [[3, 3], [3, 3], [3, 3], [3, 3], [3, 3], [3, 3], [3, 3]], 'unet_max_num_features': 512, 'resampling_fn_data': 'resample_data_or_seg_to_shape', 'resampling_fn_seg': 'resample_data_or_seg_to_shape', 'resampling_fn_data_kwargs': {'is_seg': False, 'order': 3, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_seg_kwargs': {'is_seg': True, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_probabilities': 'resample_data_or_seg_to_shape', 'resampling_fn_probabilities_kwargs': {'is_seg': False, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'batch_dice': True} 

These are the global plan.json settings:
 {'dataset_name': 'Dataset252_totalsegm250', 'plans_name': 'nnUNetPlans', 'original_median_spacing_after_transp': [1.5, 1.5, 1.5], 'original_median_shape_after_transp': [244, 244, 253], 'image_reader_writer': 'SimpleITKIO', 'transpose_forward': [0, 1, 2], 'transpose_backward': [0, 1, 2], 'experiment_planner_used': 'ExperimentPlanner', 'label_manager': 'LabelManager', 'foreground_intensity_properties_per_channel': {'0': {'max': 3378.0, 'mean': 339.5414093623482, 'median': 267.0, 'min': -1168.0, 'percentile_00_5': -75.0, 'percentile_99_5': 1382.0, 'std': 265.89214249231145}}} 

2023-08-30 11:30:30.297750: unpacking dataset...
2023-08-30 11:32:45.645903: unpacking done...
2023-08-30 11:32:45.712827: do_dummy_2d_data_aug: False
2023-08-30 11:32:46.089685: Unable to plot network architecture: nnUNet_compile is enabled!
2023-08-30 11:32:46.259701: 
2023-08-30 11:32:46.267236: Epoch 0
2023-08-30 11:32:46.276101: Current learning rate: 0.001
using pin_memory on device 0`

I am using the same compute instance, i.e., resources for both trainings: 24 vCPU, 220 GB RAM and a A100 80GB (only about 10 GB is used).

Any ideas why it is failing to trein for the larger dataset? error message is the same as before, workers died.

ykirchhoff commented 1 year ago

Hi Rui,

ah now I get it, so basically you again have the same problem as in the beginning? If I understand it correctly, your training gets stuck at this stage but doesn't directly crash? I sometimes have issues with A100 GPUs, where it gets stuck randomly at the start and I couldn't find the reason yet. But that was never with the default nnUNet configuration. It might be that there is a similar issue here but I am not sure. You could maybe give the V100 a try again, it might just magically work :grimacing: I have to admit I am a bit out of ideas but I will let you know if I find something which might help you.

Best, Yannick

rfrs commented 1 year ago

HI Yannick,

Once more, thanks for your reply, and again, i apologize for not being clear at first.

The problem with the V100 compute instance is that is has a very low number of CPU cores, only 6 (weird azure/Microsoft configurations). It was boggling down each time as well - but i can give it another try off course.

I really do not understand why it fails once the number of files and size of the datasets enlarges. Could it be because of keeping the files open?

Thanks for all.

Best Rui

rfrs commented 1 year ago

I was again testing the A100 and ir ran for exactly 1 epoch. I copy here the log. Also, once i noticed the the task was dead, i did a keyboard interrupt.

2023-09-04 08:15:32.138661: Compiling network...

This is the configuration used by this training: Configuration name: 2d {'data_identifier': 'nnUNetPlans_2d', 'preprocessor_name': 'DefaultPreprocessor', 'batch_size': 46, 'patch_size': [256, 256], 'median_image_size_in_voxels': [244.5, 253.0], 'spacing': [1.5, 1.5], 'normalization_schemes': ['CTNormalization'], 'use_mask_for_norm': [False], 'UNet_class_name': 'PlainConvUNet', 'UNet_base_num_features': 32, 'n_conv_per_stage_encoder': [2, 2, 2, 2, 2, 2, 2], 'n_conv_per_stage_decoder': [2, 2, 2, 2, 2, 2], 'num_pool_per_axis': [6, 6], 'pool_op_kernel_sizes': [[1, 1], [2, 2], [2, 2], [2, 2], [2, 2], [2, 2], [2, 2]], 'conv_kernel_sizes': [[3, 3], [3, 3], [3, 3], [3, 3], [3, 3], [3, 3], [3, 3]], 'unet_max_num_features': 512, 'resampling_fn_data': 'resample_data_or_seg_to_shape', 'resampling_fn_seg': 'resample_data_or_seg_to_shape', 'resampling_fn_data_kwargs': {'is_seg': False, 'order': 3, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_seg_kwargs': {'is_seg': True, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_probabilities': 'resample_data_or_seg_to_shape', 'resampling_fn_probabilities_kwargs': {'is_seg': False, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'batch_dice': True}

These are the global plan.json settings: {'dataset_name': 'Dataset252_totalsegm250', 'plans_name': 'nnUNetPlans', 'original_median_spacing_after_transp': [1.5, 1.5, 1.5], 'original_median_shape_after_transp': [244, 244, 253], 'image_reader_writer': 'SimpleITKIO', 'transpose_forward': [0, 1, 2], 'transpose_backward': [0, 1, 2], 'experiment_planner_used': 'ExperimentPlanner', 'label_manager': 'LabelManager', 'foreground_intensity_properties_per_channel': {'0': {'max': 3378.0, 'mean': 339.5414093623482, 'median': 267.0, 'min': -1168.0, 'percentile_00_5': -75.0, 'percentile_99_5': 1382.0, 'std': 265.89214249231145}}}

2023-09-04 08:15:32.292588: unpacking dataset... 2023-09-04 08:15:38.750538: unpacking done... 2023-09-04 08:15:38.862252: do_dummy_2d_data_aug: False 2023-09-04 08:15:39.385357: Unable to plot network architecture: nnUNet_compile is enabled! 2023-09-04 08:15:39.520333: 2023-09-04 08:15:39.528903: Epoch 0 2023-09-04 08:15:39.539532: Current learning rate: 0.001 using pin_memory on device 0 using pin_memory on device 0 2023-09-04 08:19:03.365998: train_loss 0.6896 2023-09-04 08:19:03.508641: val_loss 0.0616 2023-09-04 08:19:03.550241: Pseudo dice [0.0, 0.0, 0.0, 0.0002, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0001, 0.0, 0.0, 0.0, 0.0] 2023-09-04 08:19:03.597502: Epoch time: 203.85 s 2023-09-04 08:19:03.630496: Yayy! New best EMA pseudo Dice: 0.0 2023-09-04 08:19:07.486230: 2023-09-04 08:19:07.537513: Epoch 1 2023-09-04 08:19:07.557031: Current learning rate: 0.001 ^CProcess ForkProcess-38: Process ForkProcess-31: Process ForkProcess-21: Process ForkProcess-27: Process ForkProcess-34: Process ForkProcess-37: Process ForkProcess-35: Process ForkProcess-28: Process ForkProcess-23: Process ForkProcess-36: Process ForkProcess-40: Process ForkProcess-32: Process ForkProcess-22: Process ForkProcess-41: Process ForkProcess-19: Process ForkProcess-42: Process ForkProcess-26: Process ForkProcess-39: Process ForkProcess-25: Process ForkProcess-33: Process ForkProcess-30: Process ForkProcess-29: Process ForkProcess-24: Process ForkProcess-20: Traceback (most recent call last): File "/anaconda/envs/nnunet_linux_rs7noVal/bin/nnUNetv2_train", line 8, in sys.exit(run_training_entry()) File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/rs-a100/code/Users/aadm.rui.ramos-santos/nnUNet_linux_arch_rs7_noval/nnunetv2/run/run_training.py", line 268, in run_training_entry Traceback (most recent call last): Traceback (most recent call last): Traceback (most recent call last): Traceback (most recent call last): Traceback (most recent call last): Traceback (most recent call last): Traceback (most recent call last): Traceback (most recent call last): Traceback (most recent call last): File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap self.run() Traceback (most recent call last): File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/process.py", line 108, in run self._target(*self._args, self._kwargs) File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap self.run() Traceback (most recent call last): File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/process.py", line 108, in run self._target(*self._args, *self._kwargs) File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/process.py", line 108, in run self._target(self._args, self._kwargs) File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/process.py", line 108, in run self._target(*self._args, self._kwargs) File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/concurrent/futures/process.py", line 240, in _process_worker call_item = call_queue.get(block=True) File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/process.py", line 108, in run self._target(*self._args, *self._kwargs) File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap self.run() Traceback (most recent call last): File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/process.py", line 108, in run self._target(self._args, self._kwargs) File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/concurrent/futures/process.py", line 240, in _process_worker call_item = call_queue.get(block=True) File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/concurrent/futures/process.py", line 240, in _process_worker call_item = call_queue.get(block=True) Traceback (most recent call last): Traceback (most recent call last): File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/concurrent/futures/process.py", line 240, in _process_worker call_item = call_queue.get(block=True) File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/queues.py", line 102, in get with self._rlock: Traceback (most recent call last): File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/concurrent/futures/process.py", line 240, in _process_worker call_item = call_queue.get(block=True) File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/process.py", line 108, in run self._target(*self._args, self._kwargs) File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/concurrent/futures/process.py", line 240, in _process_worker call_item = call_queue.get(block=True) File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/queues.py", line 102, in get with self._rlock: File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap self.run() Traceback (most recent call last): File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/queues.py", line 102, in get with self._rlock: File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/queues.py", line 102, in get with self._rlock: File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/process.py", line 108, in run self._target(*self._args, *self._kwargs) File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/synchronize.py", line 95, in enter return self._semlock.enter() File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/queues.py", line 102, in get with self._rlock: File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/concurrent/futures/process.py", line 240, in _process_worker call_item = call_queue.get(block=True) File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/queues.py", line 102, in get with self._rlock: File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/synchronize.py", line 95, in enter return self._semlock.enter() File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/process.py", line 108, in run self._target(self._args, self._kwargs) File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/synchronize.py", line 95, in enter return self._semlock.enter() File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/synchronize.py", line 95, in enter return self._semlock.enter() File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/process.py", line 108, in run self._target(*self._args, self._kwargs) File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/concurrent/futures/process.py", line 240, in _process_worker call_item = call_queue.get(block=True) File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap self.run() KeyboardInterrupt File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/process.py", line 108, in run self._target(*self._args, *self._kwargs) File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/synchronize.py", line 95, in enter return self._semlock.enter() File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/process.py", line 108, in run self._target(self._args, self._kwargs) File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/queues.py", line 102, in get with self._rlock: File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/synchronize.py", line 95, in enter return self._semlock.enter() File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/concurrent/futures/process.py", line 240, in _process_worker call_item = call_queue.get(block=True) KeyboardInterrupt KeyboardInterrupt KeyboardInterrupt File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/concurrent/futures/process.py", line 240, in _process_worker call_item = call_queue.get(block=True) KeyboardInterrupt File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/queues.py", line 102, in get with self._rlock: File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/process.py", line 108, in run self._target(*self._args, self._kwargs) File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/concurrent/futures/process.py", line 240, in _process_worker call_item = call_queue.get(block=True) File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/concurrent/futures/process.py", line 240, in _process_worker call_item = call_queue.get(block=True) File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/synchronize.py", line 95, in enter return self._semlock.enter() KeyboardInterrupt File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/queues.py", line 102, in get with self._rlock: File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/queues.py", line 102, in get with self._rlock: File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/synchronize.py", line 95, in enter return self._semlock.enter() File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/concurrent/futures/process.py", line 240, in _process_worker call_item = call_queue.get(block=True) File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/queues.py", line 102, in get with self._rlock: File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/queues.py", line 103, in get res = self._recv_bytes() KeyboardInterrupt File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/synchronize.py", line 95, in enter return self._semlock.enter() File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/synchronize.py", line 95, in enter return self._semlock.enter() File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/queues.py", line 102, in get with self._rlock: File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/synchronize.py", line 95, in enter return self._semlock.enter() KeyboardInterrupt File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/connection.py", line 216, in recv_bytes buf = self._recv_bytes(maxlength) File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/synchronize.py", line 95, in enter return self._semlock.enter() KeyboardInterrupt KeyboardInterrupt KeyboardInterrupt File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/connection.py", line 414, in _recv_bytes buf = self._recv(4) KeyboardInterrupt File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/connection.py", line 379, in _recv chunk = read(handle, remaining) KeyboardInterrupt Traceback (most recent call last): Traceback (most recent call last): File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/process.py", line 108, in run self._target(*self._args, *self._kwargs) File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/concurrent/futures/process.py", line 240, in _process_worker call_item = call_queue.get(block=True) File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/process.py", line 108, in run self._target(self._args, self._kwargs) File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/queues.py", line 102, in get with self._rlock: File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/concurrent/futures/process.py", line 240, in _process_worker call_item = call_queue.get(block=True) File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/synchronize.py", line 95, in enter return self._semlock.enter() File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/queues.py", line 102, in get with self._rlock: File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/synchronize.py", line 95, in enter return self._semlock.enter() Traceback (most recent call last): KeyboardInterrupt KeyboardInterrupt File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/process.py", line 108, in run self._target(*self._args, self._kwargs) File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/concurrent/futures/process.py", line 240, in _process_worker call_item = call_queue.get(block=True) File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/queues.py", line 102, in get with self._rlock: File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/synchronize.py", line 95, in enter return self._semlock.enter() KeyboardInterrupt Traceback (most recent call last): File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/process.py", line 108, in run self._target(*self._args, *self._kwargs) File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/concurrent/futures/process.py", line 240, in _process_worker call_item = call_queue.get(block=True) File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/queues.py", line 102, in get with self._rlock: File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/synchronize.py", line 95, in enter return self._semlock.enter() KeyboardInterrupt Traceback (most recent call last): File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap self.run() Traceback (most recent call last): File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/process.py", line 108, in run self._target(self._args, self._kwargs) File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/concurrent/futures/process.py", line 240, in _process_worker call_item = call_queue.get(block=True) File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/queues.py", line 102, in get with self._rlock: File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/synchronize.py", line 95, in enter return self._semlock.enter() File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap self.run() KeyboardInterrupt File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/process.py", line 108, in run self._target(*self._args, self._kwargs) File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/concurrent/futures/process.py", line 240, in _process_worker call_item = call_queue.get(block=True) File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/queues.py", line 102, in get with self._rlock: File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/synchronize.py", line 95, in enter return self._semlock.enter() KeyboardInterrupt Traceback (most recent call last): File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/process.py", line 108, in run self._target(*self._args, *self._kwargs) File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/concurrent/futures/process.py", line 240, in _process_worker call_item = call_queue.get(block=True) File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/queues.py", line 102, in get with self._rlock: File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/synchronize.py", line 95, in enter return self._semlock.enter() KeyboardInterrupt File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap self.run() Traceback (most recent call last): File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/process.py", line 108, in run self._target(self._args, self._kwargs) File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/concurrent/futures/process.py", line 240, in _process_worker call_item = call_queue.get(block=True) File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/queues.py", line 102, in get with self._rlock: File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/synchronize.py", line 95, in enter return self._semlock.enter() KeyboardInterrupt File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/process.py", line 108, in run self._target(*self._args, self._kwargs) File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/concurrent/futures/process.py", line 240, in _process_worker call_item = call_queue.get(block=True) File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/queues.py", line 102, in get with self._rlock: File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/synchronize.py", line 95, in enter return self._semlock.enter() KeyboardInterrupt File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/process.py", line 108, in run self._target(*self._args, *self._kwargs) File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/concurrent/futures/process.py", line 240, in _process_worker call_item = call_queue.get(block=True) File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/queues.py", line 102, in get with self._rlock: File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/multiprocessing/synchronize.py", line 95, in enter return self._semlock.enter() KeyboardInterrupt run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights, File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/rs-a100/code/Users/aadm.rui.ramos-santos/nnUNet_linux_arch_rs7_noval/nnunetv2/run/run_training.py", line 204, in run_training nnunet_trainer.run_training() File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/rs-a100/code/Users/aadm.rui.ramos-santos/nnUNet_linux_arch_rs7_noval/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 1237, in run_training train_outputs.append(self.train_step(next(self.dataloader_train))) File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 196, in next item = self.__get_next_item() File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 188, in __get_next_item sleep(self.wait_time) KeyboardInterrupt Exception in thread Thread-4: Traceback (most recent call last): File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/threading.py", line 980, in _bootstrap_inner self.run() File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/threading.py", line 917, in run self._target(self._args, self._kwargs) File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 125, in results_loop raise e File "/anaconda/envs/nnunet_linux_rs7noVal/lib/python3.9/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 103, in results_loop raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the " RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message

ykirchhoff commented 1 year ago

Hi Rui,

no problem :) That really seems like a weird configuration. Although I think we might have something similar for some of the V100s in our cluster. It might actually be a problem with nnUNet_keep_files_open. I first just thought about RAM, which shouldn't be an issue as it just keeps the memmapped files open and you have a lot of RAM. But you typically also have a limit of opened file descriptors - so basically files - per process. For my machine it is 1024, you can check that in a terminal with ulimit -n. This might become a problem when the dataset gets larger. Additionally, nnUNet_keep_files_open probably doesn't help too much, as nnUNet needs to load the data into RAM anyways, as it - at least if I remember correctly - only saves the reference as memmapped file.

Best, Yannick

ykirchhoff commented 1 year ago

Hi Rui,

just checking if you could solve your issue now?

Best, Yannick

rfrs commented 1 year ago

Hey Yannick, thanks for checking in.

Things are working better since version 2.2. Thanks.