Closed rfrs closed 5 months ago
Hi Rui,
could you please post the lines before that error message? You can usually find the problem somewhere in there.
Best, Yannick
Hi Yannick, thanks for reaching out. The message is as follows:
File "/anaconda/envs/nnunet2_py39/bin/nnUNetv2_train", line 8, in
Thanks. Best Rui
Hi Rui,
thanks for the complete error. There is a KeyboardInterrupt in your error, which if you didn't trigger it yourself is probably triggered by Azure for some reason. As RAM and vRAM shouldn't be a problem it might be due to CPU usage, did you check that? You can adjust the number of processes used for training with nnUNet_n_proc_DA
, which defaults to 12. Setting nnUNet_n_proc_DA=0
will give you single threaded data augmentation, which is generally speaking slower but uses much less CPU. If that works try increasing nnUNet_n_proc_DA
.
Hope this helps. Best, Yannick
PS: nnUNet_n_proc_DA
can be adjusted by simply doing nnUNet_n_proc_DA=0 nnUNetv2_train ...
Hi Yannick, thanks for the answer, unfortunately it did not work. The code runs for a few epochs just to stop again. I print the log bellow. Since i have 12 vCPU available in the environment i set nnUNet_n_proc_DA=10 and it failed. It was really really slow when thread count was set to 0 or to 1.
$ nnUNet_n_proc_DA=10 nnUNetv2_train 200 2d 0 --npz Using device: cuda:0
####################################################################### Please cite the following paper when using nnU-Net: Isensee, F., Jaeger, P. F., Kohl, S. A., Petersen, J., & Maier-Hein, K. H. (2021). nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods, 18(2), 203-211. #######################################################################
This is the configuration used by this training: Configuration name: 2d {'data_identifier': 'nnUNetPlans_2d', 'preprocessor_name': 'DefaultPreprocessor', 'batch_size': 13, 'patch_size': [512, 448], 'median_image_size_in_voxels': [512.0, 512.0], 'spacing': [0.767578125, 0.767578125], 'normalization_schemes': ['CTNormalization'], 'use_mask_for_norm': [False], 'UNet_class_name': 'PlainConvUNet', 'UNet_base_num_features': 32, 'n_conv_per_stage_encoder': [2, 2, 2, 2, 2, 2, 2], 'n_conv_per_stage_decoder': [2, 2, 2, 2, 2, 2], 'num_pool_per_axis': [6, 6], 'pool_op_kernel_sizes': [[1, 1], [2, 2], [2, 2], [2, 2], [2, 2], [2, 2], [2, 2]], 'conv_kernel_sizes': [[3, 3], [3, 3], [3, 3], [3, 3], [3, 3], [3, 3], [3, 3]], 'unet_max_num_features': 512, 'resampling_fn_data': 'resample_data_or_seg_to_shape', 'resampling_fn_seg': 'resample_data_or_seg_to_shape', 'resampling_fn_data_kwargs': {'is_seg': False, 'order': 3, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_seg_kwargs': {'is_seg': True, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_probabilities': 'resample_data_or_seg_to_shape', 'resampling_fn_probabilities_kwargs': {'is_seg': False, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'batch_dice': True}
These are the global plan.json settings: {'dataset_name': 'Dataset200_spine', 'plans_name': 'nnUNetPlans', 'original_median_spacing_after_transp': [1.0, 0.767578125, 0.767578125], 'original_median_shape_after_transp': [541, 512, 512], 'image_reader_writer': 'SimpleITKIO', 'transpose_forward': [0, 1, 2], 'transpose_backward': [0, 1, 2], 'experiment_planner_used': 'ExperimentPlanner', 'label_manager': 'LabelManager', 'foreground_intensity_properties_per_channel': {'0': {'max': 2663.0, 'mean': 329.7226257324219, 'median': 261.0, 'min': -941.0, 'percentile_00_5': -2.0, 'percentile_99_5': 1206.0, 'std': 241.9765625}}}
2023-08-04 13:28:13.893867: unpacking dataset...
2023-08-04 13:28:17.474627: unpacking done...
2023-08-04 13:28:17.509105: do_dummy_2d_data_aug: False
2023-08-04 13:28:17.551622: Using splits from existing split file: nnUNet_models/spine1K_n25_2xV100_test/nnUNet_preprocessed/Dataset200_spine/splits_final.json
2023-08-04 13:28:17.568009: The split file contains 5 splits.
2023-08-04 13:28:17.576300: Desired fold for training: 0
2023-08-04 13:28:17.584742: This split has 20 training and 5 validation cases.
2023-08-04 13:28:19.093326: Unable to plot network architecture:
2023-08-04 13:28:19.168411: module 'torch.jit' has no attribute 'get_trace_graph'
2023-08-04 13:28:19.315055:
2023-08-04 13:28:19.323275: Epoch 0
2023-08-04 13:28:19.332131: Current learning rate: 0.01
using pin_memory on device 0
using pin_memory on device 0
/anaconda/envs/nnunet2_py39/lib/python3.9/site-packages/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py:970: RuntimeWarning: invalid value encountered in scalar divide
global_dc_per_class = [i for i in [2 i / (2 i + j + k) for i, j, k in
2023-08-04 13:30:11.100131: train_loss 0.2669
2023-08-04 13:30:11.193105: val_loss 0.0517
2023-08-04 13:30:11.205193: Pseudo dice [0.0, nan, nan, nan, nan, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
2023-08-04 13:30:11.225502: Epoch time: 111.79 s
2023-08-04 13:30:11.235616: Yayy! New best EMA pseudo Dice: 0.0
2023-08-04 13:30:14.811116:
2023-08-04 13:30:14.833648: Epoch 1
2023-08-04 13:30:14.842434: Current learning rate: 0.00999
2023-08-04 13:31:42.493429: train_loss 0.0481
2023-08-04 13:31:42.638597: val_loss 0.035
2023-08-04 13:31:42.673699: Pseudo dice [nan, nan, nan, nan, nan, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0603, 0.0, 0.0, 0.0, 0.0]
2023-08-04 13:31:42.690906: Epoch time: 87.68 s
2023-08-04 13:31:42.706916: Yayy! New best EMA pseudo Dice: 0.0003
2023-08-04 13:31:46.506501:
2023-08-04 13:31:46.524667: Epoch 2
2023-08-04 13:31:46.532792: Current learning rate: 0.00998
2023-08-04 13:33:14.203712: train_loss 0.0312
2023-08-04 13:33:14.317371: val_loss 0.0204
2023-08-04 13:33:14.334751: Pseudo dice [nan, nan, nan, nan, nan, nan, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0084, 0.0, 0.0992, 0.0003, 0.0, 0.0, 0.0]
2023-08-04 13:33:14.359694: Epoch time: 87.7 s
2023-08-04 13:33:14.376047: Yayy! New best EMA pseudo Dice: 0.0009
2023-08-04 13:33:18.812734:
2023-08-04 13:33:18.831534: Epoch 3
2023-08-04 13:33:18.841416: Current learning rate: 0.00997
2023-08-04 13:34:49.909967: train_loss 0.0225
2023-08-04 13:34:50.001015: val_loss 0.0122
2023-08-04 13:34:50.067200: Pseudo dice [nan, nan, nan, nan, nan, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0391, 0.0, 0.0227, 0.0193, 0.1111, 0.0, 0.0]
2023-08-04 13:34:50.076057: Epoch time: 91.1 s
2023-08-04 13:34:50.103311: Yayy! New best EMA pseudo Dice: 0.0018
2023-08-04 13:34:53.981759:
2023-08-04 13:34:54.005108: Epoch 4
2023-08-04 13:34:54.025370: Current learning rate: 0.00996
2023-08-04 13:36:18.416863: train_loss 0.0137
2023-08-04 13:36:18.559083: val_loss 0.0065
2023-08-04 13:36:18.572116: Pseudo dice [nan, nan, nan, nan, nan, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.1368, 0.0, 0.1498, 0.0, 0.0]
2023-08-04 13:36:18.589587: Epoch time: 84.44 s
2023-08-04 13:36:18.598529: Yayy! New best EMA pseudo Dice: 0.0031
2023-08-04 13:36:22.174985:
2023-08-04 13:36:22.198013: Epoch 5
2023-08-04 13:36:22.206127: Current learning rate: 0.00995
2023-08-04 13:37:49.409249: train_loss 0.0064
2023-08-04 13:37:49.512477: val_loss -0.0033
2023-08-04 13:37:49.527299: Pseudo dice [nan, nan, nan, nan, nan, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.1247, 0.0, 0.0, 0.0, 0.2593, 0.1335]
2023-08-04 13:37:49.554250: Epoch time: 87.24 s
2023-08-04 13:37:49.564446: Yayy! New best EMA pseudo Dice: 0.0055
2023-08-04 13:37:52.940592:
2023-08-04 13:37:52.959001: Epoch 6
2023-08-04 13:37:52.967198: Current learning rate: 0.00995
2023-08-04 13:39:19.627892: train_loss -0.0041
2023-08-04 13:39:19.742339: val_loss -0.0134
2023-08-04 13:39:19.760233: Pseudo dice [nan, nan, nan, nan, nan, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0072, 0.0, 0.0, 0.0002, 0.3249, 0.0, 0.1936]
2023-08-04 13:39:19.773959: Epoch time: 86.69 s
2023-08-04 13:39:19.798073: Yayy! New best EMA pseudo Dice: 0.0078
2023-08-04 13:39:23.230338:
2023-08-04 13:39:23.269459: Epoch 7
2023-08-04 13:39:23.277954: Current learning rate: 0.00994
2023-08-04 13:40:53.870357: train_loss -0.0171
2023-08-04 13:40:53.989840: val_loss -0.0252
2023-08-04 13:40:54.008450: Pseudo dice [nan, nan, nan, nan, nan, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0125, 0.0024, 0.0155, 0.0, 0.3599, 0.0455, 0.18]
2023-08-04 13:40:54.032427: Epoch time: 90.64 s
2023-08-04 13:40:54.055148: Yayy! New best EMA pseudo Dice: 0.0102
2023-08-04 13:40:57.568206:
2023-08-04 13:40:57.590469: Epoch 8
2023-08-04 13:40:57.598779: Current learning rate: 0.00993
2023-08-04 13:42:24.810461: train_loss -0.0314
2023-08-04 13:42:24.915588: val_loss -0.039
2023-08-04 13:42:24.937008: Pseudo dice [nan, nan, nan, nan, nan, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.093, 0.011, 0.3005, 0.2939, 0.4449, 0.0002, 0.1842]
2023-08-04 13:42:24.950041: Epoch time: 87.24 s
2023-08-04 13:42:24.966740: Yayy! New best EMA pseudo Dice: 0.0162
2023-08-04 13:42:28.827302:
2023-08-04 13:42:28.846198: Epoch 9
2023-08-04 13:42:28.854600: Current learning rate: 0.00992
^CTraceback (most recent call last):
File "/anaconda/envs/nnunet2_py39/bin/nnUNetv2_train", line 8, in
Do you have any further advices? We really would like to have this running in the cloud especially since we don't have beefy enough GPUs on site.
Thanks Best Rui
Hi Rui,
yeah training times become a pain when setting low nnUNet_n_proc_DA
.
It is kind of weird that the error appears only after a few epochs. It seems to be triggered by Azure for some reason, so if it is neither RAM, VRAM or CPU I am a bit out of ideas here. The only other thing I can think of right now is some kind of timeout. Does it crash consistently after 9 epochs?
Do other algorithms work fine on Azure?
Best, Yannick
Hey Yannick, No it is not consistent, as it has crashed at both epoch 1 as epoch 9, etc. Do you have any reccomendation for deploying nnUNET in cloud computing, i.e., specs for the compute instance to use?
Thanks for all Best Rui
Hi Rui,
no problem at all and sorry that I couldn't help more. I will try to find out more about your problem. Regarding specs, everything with at least 10GB VRAM, preferably 6/12 cores/threads and 32GB or better 64GB of RAM should be more than enough. So nearly everything that is available for cloud computing ;). That is also why I am kind of confused about your problems with Azure. If you just want to try out and test the nnUNet pipeline you can have a look at the nnUNet workshop, which details how to setup nnUNet in Google Colab. Just take the parts you need and use your own data. With the free option you can run trainings for 12 hours if I am correct, so probably not enough for training of a complete model but good enough for some initial tests without wasting money on compute instances where trainings just fail.
Best, Yannick
Hey Rui,
something came up in a discussion, are you running nnUNet in a docker container or similar on Azure? With docker you have to set --ipc=host
or --shm-size=8GB
in your docker run command (8GB just as an example), otherwise you run into issues with transferring data between the main training process and the processes used for dataloading and augmentation.
Best, Yannick
Hi Yannick,
Do i set this when i do the training? Meaning
nnUNetv2_train ... --ipc=host
Thanks
Best Rui
Hey Rui,
@ykirchhoff is talking about docker containers. If you use nnU-Net inside a docker container you need to give it sufficient shared memory in order for it to run properly. This is because the communication of tensors between python processes requires that. Exceeding the available shared memory will cause processes to be killed, which manifests itself in the same symptoms you also have here.
I don't know how azure is handling all of that though, so I cannot say whether this is the same problem or something different.
One thing we did in nnU-Net was changing the start method of workers from fork
to spawn
. This could potentially cause problems for you. You can search and replace all multiprocessing.get_context("spawn").Pool
code with multiprocessing.get_context("fork").Pool
and try again. Please also set OMP_NUM_THREADS=1
in your environment.
If you figure out what the problem is, please share it with us so that we can help others (#1343) in the future.
Best,
Fabian
PS: Maybe the following helps you out as well: The error you have means that one of the background workers is no longer alive. Since it didn't print you a proper error message (just a KeyboardInterrupt) my hypothesis is that the OS killed the worker for some reason. This can be because of a variety of things (exceeding shared memory is one), and maybe investigating that can help you get to the correct answer
Hey
Thanks for the suggestions. I am running it in Azure in a conda environment. Should be a similar environment as in google cloud. I have not installed it via docker. I doubt there are RAM issues as i have 220 GB available of RAM and i have been monitoring it closely. What i notice while monitoring id that the CPU cores assigned to nnUNET stop working and become idle and the GPU also stops.
Also, to test this multiprocessing.get_context("spawn").Pool to multiprocessing.get_context("fork").Pool which script i need to access?
Thanks i will keep you posted.
Best Rui
You need to replace all occurrences of multiprocessing.get_context("spawn").Pool
with multiprocessing.get_context("fork").Pool
. Many files are affected. Best to do this via a proper IDE like pycharm (ctrl+shift+r)
Hi Fabian and Yannick...
Something happened and nnUnet worked just fine!
I was going through the scripts and found the configuration.py under utilities. I changed default_num_processes = 8
to default_num_processes = 6
(since i have vCPUs = 6 in the current Azure/cloud compute instance) and it ran without a problem.
Also by using nnUNet_keep_files_open=True nnUNet_compile=True nnUNetv2_train..
. i could make it run about 3x faster.
Hi Rui,
that is quite unexpected, default_num_processes
should during training, iirc, only be used in unpacking the data and exporting the segmentations in the final validation run at the end of the training. So no obvious reason to me, why it should change anything during training itself. And with 6 vCPUs default_num_processes=8
should still be no problem at all. But that might be an issue to further investigate.
But I am glad it works now! Have fun playing around with nnUNet and let us know if you manage to break it again 😜
For nnUNet_keep_files_open=True
and nnUNet_compile=True
that is expected to significantly improve training performance. nnUNet_compile
just enables torch.compile
and nnUNet_keep_files_open
reduces the amount of memory accesses, which for some datasets becomes a major bottleneck.
Best, Yannick
Thanks for all so far - i will keep you posted :) as i am also going to test different compute instances in the cloud.
Best Rui
By applying this, i could do the training but there is still a big but ... at the end of the training and immediate fold prediction, there is a huge RAM usage spike and then again i have the same "workers have dided" issue... Do you have any suggestions?
Thanks a lot for your help so far.
Maybe better to summarize ...
I replaced all multiprocessing.get_context("spawn").Pool
with multiprocessing.get_context("fork").Pool
and now training is done without workers dying. Well i also set OMP_NUM_THREADS=1
and nnUNet_n_proc_DA=4
(out of 6 cores) or nnUNet_n_proc_DA=18
(out of a 24 cores system).
The final stage of training - prediction of fold is accompanied by a huge spike in RAM usage until the process dies...
The latest infos nnUnet gives back is:
2023-08-21 14:22:19.858534: Using splits from existing split file: ...
2023-08-21 14:22:19.908329: The split file contains 5 splits.
2023-08-21 14:22:19.916997: Desired fold for training: 0
2023-08-21 14:22:19.925106: This split has 20 training and 5 validation cases.
2023-08-21 14:22:20.080873: predicting spine1Kreduced_008
2023-08-21 14:22:32.057138: predicting spine1Kreduced_013
2023-08-21 14:22:38.197241: predicting spine1Kreduced_016
2023-08-21 14:22:45.114269: predicting spine1Kreduced_019
2023-08-21 14:22:50.689933: predicting spine1Kreduced_021
And it gets stuck consuming RAM till it boggles down .... Suggestions?
Hey Rui,
glad to hear that at least the training itself is now working. Your training crashes during the final validation nnUNet does. Do you get any of the final predictions in the validation
folder?
The problem might be with the segmentation export done here, there can be spikes in RAM usage, especially if your images are large. You could try reducing default_num_processes
here further. This will not influence anything else like dataloading but is just used for unpacking of the data and exporting the segmentations.
Best, Yannick
Hi Yannick, once more thanks for the reply.
From the 5 files to be predicted, i find predictions in the validation folder for 13, 16, 19 and 21, but not for 8 - perhaps something is wrong there. I checked the file and it is not corrupted. All files are about the same size: between 150-250 mb.
You recommend using default_num_processes
only for inference/validation? Chnage it from 3 to 2 for example?
Thanks
Best wishes Rui
Hi Rui,
the issue then probably is the case 8. What are the shapes and spacings of the 5 cases? I would assume that case 8 is exceptionally large. You can change default_num_processes
in configuration.py as you did before. Just decrease it until it works or you reach 1 (hopefully it works before that). It won't have any other effects during training. What is your current value for default_num_processes
?
Best, Yannick
Before i test that, is there a way to prevent the validation step during training?
I had set up nnUNet_n_proc_DA=15
since i have 24 vCPU cores, 220GB of RAM and a A100 with 80GB of vRAM.
Regarding the image dimensions, i print here the headers of image 8 (not predicted by nnUNEt) and image 16... they are abou the same size though...
Image 8 header is:
<class 'nibabel.nifti1.Nifti1Header'> object, endian='<'
sizeof_hdr : 348
data_type : b''
db_name : b''
extents : 0
session_error : 0
regular : b'r'
dim_info : 0
dim : [ 3 512 512 541 1 1 1 1]
intent_p1 : 0.0
intent_p2 : 0.0
intent_p3 : 0.0
intent_code : none
datatype : float32
bitpix : 32
slice_start : 0
pixdim : [1. 0.84765625 0.84765625 1. 0. 0.
0. 0. ]
vox_offset : 0.0
scl_slope : nan
scl_inter : nan
slice_end : 0
slice_code : unknown
xyzt_units : 10
cal_max : 0.0
cal_min : 0.0
slice_duration : 0.0
toffset : 0.0
glmax : 0
glmin : 0
descrip : b'5.0.10'
aux_file : b'80ml Imeron 400 Ven s G'
qform_code : scanner
sform_code : scanner
quatern_b : 0.0
quatern_c : 0.0
quatern_d : 0.0
qoffset_x : -228.57617
qoffset_y : -70.57617
qoffset_z : -596.6
srow_x : [ 0.84765625 0. 0. -228.57617 ]
srow_y : [ 0. 0.84765625 0. -70.57617 ]
srow_z : [ 0. 0. 1. -596.6]
intent_name : b''
magic : b'n+1'
Image 16 header is:
<class 'nibabel.nifti1.Nifti1Header'> object, endian='<'
sizeof_hdr : 348
data_type : b''
db_name : b''
extents : 0
session_error : 0
regular : b'r'
dim_info : 0
dim : [ 3 512 512 547 1 1 1 1]
intent_p1 : 0.0
intent_p2 : 0.0
intent_p3 : 0.0
intent_code : none
datatype : float32
bitpix : 32
slice_start : 0
pixdim : [1. 0.7519531 0.7519531 0.799988 0. 0. 0.
0. ]
vox_offset : 0.0
scl_slope : nan
scl_inter : nan
slice_end : 0
slice_code : unknown
xyzt_units : 10
cal_max : 0.0
cal_min : 0.0
slice_duration : 0.0
toffset : 0.0
glmax : 0
glmin : 0
descrip : b'5.0.10'
aux_file : b'1 mm pv RoutineOPTAbdom'
qform_code : scanner
sform_code : scanner
quatern_b : 0.0
quatern_c : 0.0
quatern_d : 0.0
qoffset_x : -181.74805
qoffset_y : -331.74805
qoffset_z : -542.8
srow_x : [ 0.7519531 0. 0. -181.74805 ]
srow_y : [ 0. 0.7519531 0. -331.74805 ]
srow_z : [ 0. 0. 0.799988 -542.8 ]
intent_name : b''
magic : b'n+1'
Hey,
there is no setting to disable the validation step at each epoch, you would have to write your own trainer class and set it yourself in the run_training method. The validation at the end can also by default not be skipped. You would again have to modify it yourself here.
I meant default_num_processes
not nnUNet_n_proc_DA
as it should not be related to data augmentation here but the processing of the files for prediction/segmentation export.
The files really look similar from the metadata. I am not quite sure what is going on there. Is there a chance that you could send me the dataset and trained model so I can try it out on my machine?
Best, Yannick
Hi,
I could "supress" the final validation step and could run both the training and the prediction without issues in Azure. Off course i changed throughout the codemultiprocessing.get_context("spawn").Pool
with multiprocessing.get_context("fork").Pool
. This worked well for a test dataset of N=25... But i am having again the issue of workers going offline with an N=200... Any further suggestions?
Best Rui
Hi Rui,
just so I understand you correctly, the issue appears when you run nnUNetv2_predict ...
on the test set with N=200 but works on N=25? There is probably some problem with predicted files waiting to be saved and filling up your RAM. I will take a more detailed look into the part of the code hopefully tomorrow and will come back to you after that.
Best, Yannick
Dear Yannick, thanks for the reply.
I apologise for the confusion, but i refer again to nnUNetv2_train...
but in this case, where the trainset is much larger, from 25 to 200 images, and once more, workers stop and training halts.
Thanks Best Rui
As an example, just during the day i could do nnUNet_keep_files_open=True nnUNet_compile=True nnUNetv2_train 251 2d all -device cuda
and nnUNet_keep_files_open=True nnUNet_compile=True nnUNetv2_train 251 3d_fullres all -device cuda
. Training ran fine for 500 epochs as i had set up. Dataset ID 251 is composed of the first 100 files (about 2.5 GB) from TotalSegmentator and i am training for 24 classes (all vertebrae).
Nonetheless, i have now tried to run the same, for the first 250 cases (about 5 GB) of the totalsegmentator dataset and it failed to start.
`This is the configuration used by this training:
Configuration name: 2d
{'data_identifier': 'nnUNetPlans_2d', 'preprocessor_name': 'DefaultPreprocessor', 'batch_size': 46, 'patch_size': [256, 256], 'median_image_size_in_voxels': [244.5, 253.0], 'spacing': [1.5, 1.5], 'normalization_schemes': ['CTNormalization'], 'use_mask_for_norm': [False], 'UNet_class_name': 'PlainConvUNet', 'UNet_base_num_features': 32, 'n_conv_per_stage_encoder': [2, 2, 2, 2, 2, 2, 2], 'n_conv_per_stage_decoder': [2, 2, 2, 2, 2, 2], 'num_pool_per_axis': [6, 6], 'pool_op_kernel_sizes': [[1, 1], [2, 2], [2, 2], [2, 2], [2, 2], [2, 2], [2, 2]], 'conv_kernel_sizes': [[3, 3], [3, 3], [3, 3], [3, 3], [3, 3], [3, 3], [3, 3]], 'unet_max_num_features': 512, 'resampling_fn_data': 'resample_data_or_seg_to_shape', 'resampling_fn_seg': 'resample_data_or_seg_to_shape', 'resampling_fn_data_kwargs': {'is_seg': False, 'order': 3, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_seg_kwargs': {'is_seg': True, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_probabilities': 'resample_data_or_seg_to_shape', 'resampling_fn_probabilities_kwargs': {'is_seg': False, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'batch_dice': True}
These are the global plan.json settings:
{'dataset_name': 'Dataset252_totalsegm250', 'plans_name': 'nnUNetPlans', 'original_median_spacing_after_transp': [1.5, 1.5, 1.5], 'original_median_shape_after_transp': [244, 244, 253], 'image_reader_writer': 'SimpleITKIO', 'transpose_forward': [0, 1, 2], 'transpose_backward': [0, 1, 2], 'experiment_planner_used': 'ExperimentPlanner', 'label_manager': 'LabelManager', 'foreground_intensity_properties_per_channel': {'0': {'max': 3378.0, 'mean': 339.5414093623482, 'median': 267.0, 'min': -1168.0, 'percentile_00_5': -75.0, 'percentile_99_5': 1382.0, 'std': 265.89214249231145}}}
2023-08-30 11:30:30.297750: unpacking dataset...
2023-08-30 11:32:45.645903: unpacking done...
2023-08-30 11:32:45.712827: do_dummy_2d_data_aug: False
2023-08-30 11:32:46.089685: Unable to plot network architecture: nnUNet_compile is enabled!
2023-08-30 11:32:46.259701:
2023-08-30 11:32:46.267236: Epoch 0
2023-08-30 11:32:46.276101: Current learning rate: 0.001
using pin_memory on device 0`
I am using the same compute instance, i.e., resources for both trainings: 24 vCPU, 220 GB RAM and a A100 80GB (only about 10 GB is used).
Any ideas why it is failing to trein for the larger dataset? error message is the same as before, workers died.
Hi Rui,
ah now I get it, so basically you again have the same problem as in the beginning? If I understand it correctly, your training gets stuck at this stage but doesn't directly crash? I sometimes have issues with A100 GPUs, where it gets stuck randomly at the start and I couldn't find the reason yet. But that was never with the default nnUNet configuration. It might be that there is a similar issue here but I am not sure. You could maybe give the V100 a try again, it might just magically work :grimacing: I have to admit I am a bit out of ideas but I will let you know if I find something which might help you.
Best, Yannick
HI Yannick,
Once more, thanks for your reply, and again, i apologize for not being clear at first.
The problem with the V100 compute instance is that is has a very low number of CPU cores, only 6 (weird azure/Microsoft configurations). It was boggling down each time as well - but i can give it another try off course.
I really do not understand why it fails once the number of files and size of the datasets enlarges. Could it be because of keeping the files open?
Thanks for all.
Best Rui
I was again testing the A100 and ir ran for exactly 1 epoch. I copy here the log. Also, once i noticed the the task was dead, i did a keyboard interrupt.
2023-09-04 08:15:32.138661: Compiling network...
This is the configuration used by this training: Configuration name: 2d {'data_identifier': 'nnUNetPlans_2d', 'preprocessor_name': 'DefaultPreprocessor', 'batch_size': 46, 'patch_size': [256, 256], 'median_image_size_in_voxels': [244.5, 253.0], 'spacing': [1.5, 1.5], 'normalization_schemes': ['CTNormalization'], 'use_mask_for_norm': [False], 'UNet_class_name': 'PlainConvUNet', 'UNet_base_num_features': 32, 'n_conv_per_stage_encoder': [2, 2, 2, 2, 2, 2, 2], 'n_conv_per_stage_decoder': [2, 2, 2, 2, 2, 2], 'num_pool_per_axis': [6, 6], 'pool_op_kernel_sizes': [[1, 1], [2, 2], [2, 2], [2, 2], [2, 2], [2, 2], [2, 2]], 'conv_kernel_sizes': [[3, 3], [3, 3], [3, 3], [3, 3], [3, 3], [3, 3], [3, 3]], 'unet_max_num_features': 512, 'resampling_fn_data': 'resample_data_or_seg_to_shape', 'resampling_fn_seg': 'resample_data_or_seg_to_shape', 'resampling_fn_data_kwargs': {'is_seg': False, 'order': 3, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_seg_kwargs': {'is_seg': True, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_probabilities': 'resample_data_or_seg_to_shape', 'resampling_fn_probabilities_kwargs': {'is_seg': False, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'batch_dice': True}
These are the global plan.json settings: {'dataset_name': 'Dataset252_totalsegm250', 'plans_name': 'nnUNetPlans', 'original_median_spacing_after_transp': [1.5, 1.5, 1.5], 'original_median_shape_after_transp': [244, 244, 253], 'image_reader_writer': 'SimpleITKIO', 'transpose_forward': [0, 1, 2], 'transpose_backward': [0, 1, 2], 'experiment_planner_used': 'ExperimentPlanner', 'label_manager': 'LabelManager', 'foreground_intensity_properties_per_channel': {'0': {'max': 3378.0, 'mean': 339.5414093623482, 'median': 267.0, 'min': -1168.0, 'percentile_00_5': -75.0, 'percentile_99_5': 1382.0, 'std': 265.89214249231145}}}
2023-09-04 08:15:32.292588: unpacking dataset...
2023-09-04 08:15:38.750538: unpacking done...
2023-09-04 08:15:38.862252: do_dummy_2d_data_aug: False
2023-09-04 08:15:39.385357: Unable to plot network architecture: nnUNet_compile is enabled!
2023-09-04 08:15:39.520333:
2023-09-04 08:15:39.528903: Epoch 0
2023-09-04 08:15:39.539532: Current learning rate: 0.001
using pin_memory on device 0
using pin_memory on device 0
2023-09-04 08:19:03.365998: train_loss 0.6896
2023-09-04 08:19:03.508641: val_loss 0.0616
2023-09-04 08:19:03.550241: Pseudo dice [0.0, 0.0, 0.0, 0.0002, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0001, 0.0, 0.0, 0.0, 0.0]
2023-09-04 08:19:03.597502: Epoch time: 203.85 s
2023-09-04 08:19:03.630496: Yayy! New best EMA pseudo Dice: 0.0
2023-09-04 08:19:07.486230:
2023-09-04 08:19:07.537513: Epoch 1
2023-09-04 08:19:07.557031: Current learning rate: 0.001
^CProcess ForkProcess-38:
Process ForkProcess-31:
Process ForkProcess-21:
Process ForkProcess-27:
Process ForkProcess-34:
Process ForkProcess-37:
Process ForkProcess-35:
Process ForkProcess-28:
Process ForkProcess-23:
Process ForkProcess-36:
Process ForkProcess-40:
Process ForkProcess-32:
Process ForkProcess-22:
Process ForkProcess-41:
Process ForkProcess-19:
Process ForkProcess-42:
Process ForkProcess-26:
Process ForkProcess-39:
Process ForkProcess-25:
Process ForkProcess-33:
Process ForkProcess-30:
Process ForkProcess-29:
Process ForkProcess-24:
Process ForkProcess-20:
Traceback (most recent call last):
File "/anaconda/envs/nnunet_linux_rs7noVal/bin/nnUNetv2_train", line 8, in
Hi Rui,
no problem :) That really seems like a weird configuration. Although I think we might have something similar for some of the V100s in our cluster.
It might actually be a problem with nnUNet_keep_files_open
. I first just thought about RAM, which shouldn't be an issue as it just keeps the memmapped files open and you have a lot of RAM. But you typically also have a limit of opened file descriptors - so basically files - per process. For my machine it is 1024, you can check that in a terminal with ulimit -n
. This might become a problem when the dataset gets larger. Additionally, nnUNet_keep_files_open
probably doesn't help too much, as nnUNet needs to load the data into RAM anyways, as it - at least if I remember correctly - only saves the reference as memmapped file.
Best, Yannick
Hi Rui,
just checking if you could solve your issue now?
Best, Yannick
Hey Yannick, thanks for checking in.
Things are working better since version 2.2. Thanks.
Dear all,
I have been testing the nnUnet pipeline in Azure. I am performing preliminary tests with N=25 3d images, in a compute instance with 200GB of RAM and a Nvidia V100 with 16GB of vRAM. Nonetheless, during the trainig step, it runs for a few epochs and stops... but does not provide any error message... does not seem to be a vRAM issue as it stays steady at about 50-60% of usage neither it seems a RAM issue. Any ideas to troubleshoot this?
I am getting the error: One or more background workers are no longer alive. Exiting. Please check the " RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message
I tried to OMP_NUM_THREADS=1 but i still keep getting an error. Ideas?
Thanks. Best Rui