Epoch time extremely slow

jjjcs commented 2 years ago

Hi,

The epoch time is extremely slow in my case. Each batch takes about half hour to be trained, and threr are 250 batches pre epoch. Meanwhile, each epoch takes only about 800s for my colleague, we are using the same dataset. The gpu utilize rate is also extermly low during the training, always about 3-4%, as shown below. Wed Jun 22 14:03:28 2022
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.74 Driver Version: 470.74 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... Off | 00000000:01:00.0 On | N/A | | 30% 35C P8 37W / 350W | 3858MiB / 24259MiB | 3% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 1782 G /usr/lib/xorg/Xorg 191MiB | | 0 N/A N/A 1923 G /usr/bin/gnome-shell 51MiB | | 0 N/A N/A 2631 G ...812947992597288929,131072 81MiB | | 0 N/A N/A 29026 G /proc/self/exe 26MiB | | 0 N/A N/A 225465 C .../envs/testSong/bin/python 3502MiB | +-----------------------------------------------------------------------------+ The cpu is used alot. top - 14:06:37 up 3:22, 1 user, load average: 10.21, 10.36, 8.48 Tasks: 405 total, 12 running, 393 sleeping, 0 stopped, 0 zombie %Cpu(s): 58.8 us, 5.0 sy, 0.0 ni, 36.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st MiB Mem : 64035.6 total, 522.7 free, 16163.3 used, 47349.6 buff/cache MiB Swap: 2048.0 total, 1384.7 free, 663.2 used. 46165.2 avail Mem

PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND

225650 hurwa 20 0 17.1g 3.9g 165208 R 100.3 6.3 19:49.94 python
225643 hurwa 20 0 17.1g 3.9g 160776 R 100.0 6.3 19:53.89 python
225645 hurwa 20 0 17.1g 3.9g 165208 R 100.0 6.3 19:53.95 python
225646 hurwa 20 0 17.1g 3.9g 159492 R 100.0 6.3 19:55.39 python
225651 hurwa 20 0 17.1g 3.9g 159556 R 100.0 6.3 19:55.35 python
225652 hurwa 20 0 17.1g 3.9g 163248 R 100.0 6.2 19:49.93 python
225654 hurwa 20 0 17.1g 3.9g 160776 R 100.0 6.3 19:55.04 python
225655 hurwa 20 0 17.1g 3.9g 165136 R 100.0 6.3 19:48.64 python
225641 hurwa 20 0 17.1g 3.9g 165784 R 99.7 6.3 19:50.15 python
225644 hurwa 20 0 17.1g 3.9g 164532 R 99.7 6.3 19:47.17 python
Also the repo of the training. Please cite the following paper when using nnUNet: Fabian Isensee, Paul F. Jäger, Simon A. A. Kohl, Jens Petersen, Klaus H. Maier-Hein "Automated Design of Deep Learning Methods for Biomedical Image Segmentation" arXiv preprint arXiv:1904.08128 (2020). If you have questions or suggestions, feel free to open an issue at https://github.com/MIC-DKFZ/nnUNet

Fold : 3 ############################################### I am running the following nnUNet: 3d_fullres My trainer class is: <class 'nnunet.training.network_training.nnUNetTrainerV2.nnUNetTrainerV2'> For that I will be using the following configuration: num_classes: 2 modalities: {0: 'CT'} use_mask_for_norm OrderedDict([(0, False)]) keep_only_largest_region None min_region_size_per_class None min_size_per_class None normalization_schemes OrderedDict([(0, 'CT')]) stages...

stage: 0 {'batch_size': 2, 'num_pool_per_axis': [4, 6, 4], 'patch_size': array([ 96, 256, 96]), 'median_patient_size_in_voxels': array([142, 467, 142]), 'current_spacing': array([3.17953509, 2.16575081, 3.17953509]), 'original_spacing': array([0.88085902, 0.60000002, 0.88085902]), 'do_dummy_2D_data_aug': False, 'pool_op_kernel_sizes': [[2, 2, 2], [2, 2, 2], [2, 2, 2], [2, 2, 2], [1, 2, 1], [1, 2, 1]], 'conv_kernel_sizes': [[3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3]]}

stage: 1 {'batch_size': 2, 'num_pool_per_axis': [4, 6, 4], 'patch_size': array([ 96, 256, 96]), 'median_patient_size_in_voxels': array([ 512, 1684, 512]), 'current_spacing': array([0.88085902, 0.60000002, 0.88085902]), 'original_spacing': array([0.88085902, 0.60000002, 0.88085902]), 'do_dummy_2D_data_aug': False, 'pool_op_kernel_sizes': [[2, 2, 2], [2, 2, 2], [2, 2, 2], [2, 2, 2], [1, 2, 1], [1, 2, 1]], 'conv_kernel_sizes': [[3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3]]}

I am using stage 1 from these plans I am using batch dice + CE loss

I am using data from this folder: /home/hurwa/nnUnetBase/preprocessed/Task207_Femur_Tibia/nnUNetData_plans_v2.1 ############################################### loading dataset loading all case properties unpacking dataset done 2022-06-22 13:46:32.017083: lr: 0.01 using pin_memory on device 0 using pin_memory on device 0 2022-06-22 13:46:45.825761: Unable to plot network architecture: 2022-06-22 13:46:45.825931: No module named 'hiddenlayer' 2022-06-22 13:46:45.825973: printing the network instead:

2022-06-22 13:46:45.826014: Generic_UNet( (conv_blocks_localization): ModuleList( (0): Sequential( (0): StackedConvLayers( (blocks): Sequential( (0): ConvDropoutNormNonlin( (conv): Conv3d(640, 320, kernel_size=(3, 3, 3), stride=(1, 1, 1), padding=(1, 1, 1)) (instnorm): InstanceNorm3d(320, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False) (lrelu): LeakyReLU(negative_slope=0.01, inplace=True) ) ) ) (1): StackedConvLayers( (blocks): Sequential( (0): ConvDropoutNormNonlin( (conv): Conv3d(320, 320, kernel_size=(3, 3, 3), stride=(1, 1, 1), padding=(1, 1, 1)) (instnorm): InstanceNorm3d(320, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False) (lrelu): LeakyReLU(negative_slope=0.01, inplace=True) ) ) ) ) (1): Sequential( (0): StackedConvLayers( (blocks): Sequential( (0): ConvDropoutNormNonlin( (conv): Conv3d(640, 320, kernel_size=(3, 3, 3), stride=(1, 1, 1), padding=(1, 1, 1)) (instnorm): InstanceNorm3d(320, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False) (lrelu): LeakyReLU(negative_slope=0.01, inplace=True) ) ) ) (1): StackedConvLayers( (blocks): Sequential( (0): ConvDropoutNormNonlin( (conv): Conv3d(320, 320, kernel_size=(3, 3, 3), stride=(1, 1, 1), padding=(1, 1, 1)) (instnorm): InstanceNorm3d(320, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False) (lrelu): LeakyReLU(negative_slope=0.01, inplace=True) ) ) ) ) (2): Sequential( (0): StackedConvLayers( (blocks): Sequential( (0): ConvDropoutNormNonlin( (conv): Conv3d(512, 256, kernel_size=(3, 3, 3), stride=(1, 1, 1), padding=(1, 1, 1)) (instnorm): InstanceNorm3d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False) (lrelu): LeakyReLU(negative_slope=0.01, inplace=True) ) ) ) (1): StackedConvLayers( (blocks): Sequential( (0): ConvDropoutNormNonlin( (conv): Conv3d(256, 256, kernel_size=(3, 3, 3), stride=(1, 1, 1), padding=(1, 1, 1)) (instnorm): InstanceNorm3d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False) (lrelu): LeakyReLU(negative_slope=0.01, inplace=True) ) ) ) ) (3): Sequential( (0): StackedConvLayers( (blocks): Sequential( (0): ConvDropoutNormNonlin( (conv): Conv3d(256, 128, kernel_size=(3, 3, 3), stride=(1, 1, 1), padding=(1, 1, 1)) (instnorm): InstanceNorm3d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False) (lrelu): LeakyReLU(negative_slope=0.01, inplace=True) ) ) ) (1): StackedConvLayers( (blocks): Sequential( (0): ConvDropoutNormNonlin( (conv): Conv3d(128, 128, kernel_size=(3, 3, 3), stride=(1, 1, 1), padding=(1, 1, 1)) (instnorm): InstanceNorm3d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False) (lrelu): LeakyReLU(negative_slope=0.01, inplace=True) ) ) ) ) (4): Sequential( (0): StackedConvLayers( (blocks): Sequential( (0): ConvDropoutNormNonlin( (conv): Conv3d(128, 64, kernel_size=(3, 3, 3), stride=(1, 1, 1), padding=(1, 1, 1)) (instnorm): InstanceNorm3d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False) (lrelu): LeakyReLU(negative_slope=0.01, inplace=True) ) ) ) (1): StackedConvLayers( (blocks): Sequential( (0): ConvDropoutNormNonlin( (conv): Conv3d(64, 64, kernel_size=(3, 3, 3), stride=(1, 1, 1), padding=(1, 1, 1)) (instnorm): InstanceNorm3d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False) (lrelu): LeakyReLU(negative_slope=0.01, inplace=True) ) ) ) ) (5): Sequential( (0): StackedConvLayers( (blocks): Sequential( (0): ConvDropoutNormNonlin( (conv): Conv3d(64, 32, kernel_size=(3, 3, 3), stride=(1, 1, 1), padding=(1, 1, 1)) (instnorm): InstanceNorm3d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False) (lrelu): LeakyReLU(negative_slope=0.01, inplace=True) ) ) ) (1): StackedConvLayers( (blocks): Sequential( (0): ConvDropoutNormNonlin( (conv): Conv3d(32, 32, kernel_size=(3, 3, 3), stride=(1, 1, 1), padding=(1, 1, 1)) (instnorm): InstanceNorm3d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False) (lrelu): LeakyReLU(negative_slope=0.01, inplace=True) ) ) ) ) ) (conv_blocks_context): ModuleList( (0): StackedConvLayers( (blocks): Sequential( (0): ConvDropoutNormNonlin( (conv): Conv3d(1, 32, kernel_size=(3, 3, 3), stride=(1, 1, 1), padding=(1, 1, 1)) (instnorm): InstanceNorm3d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False) (lrelu): LeakyReLU(negative_slope=0.01, inplace=True) ) (1): ConvDropoutNormNonlin( (conv): Conv3d(32, 32, kernel_size=(3, 3, 3), stride=(1, 1, 1), padding=(1, 1, 1)) (instnorm): InstanceNorm3d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False) (lrelu): LeakyReLU(negative_slope=0.01, inplace=True) ) ) ) (1): StackedConvLayers( (blocks): Sequential( (0): ConvDropoutNormNonlin( (conv): Conv3d(32, 64, kernel_size=(3, 3, 3), stride=(2, 2, 2), padding=(1, 1, 1)) (instnorm): InstanceNorm3d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False) (lrelu): LeakyReLU(negative_slope=0.01, inplace=True) ) (1): ConvDropoutNormNonlin( (conv): Conv3d(64, 64, kernel_size=(3, 3, 3), stride=(1, 1, 1), padding=(1, 1, 1)) (instnorm): InstanceNorm3d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False) (lrelu): LeakyReLU(negative_slope=0.01, inplace=True) ) ) ) (2): StackedConvLayers( (blocks): Sequential( (0): ConvDropoutNormNonlin( (conv): Conv3d(64, 128, kernel_size=(3, 3, 3), stride=(2, 2, 2), padding=(1, 1, 1)) (instnorm): InstanceNorm3d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False) (lrelu): LeakyReLU(negative_slope=0.01, inplace=True) ) (1): ConvDropoutNormNonlin( (conv): Conv3d(128, 128, kernel_size=(3, 3, 3), stride=(1, 1, 1), padding=(1, 1, 1)) (instnorm): InstanceNorm3d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False) (lrelu): LeakyReLU(negative_slope=0.01, inplace=True) ) ) ) (3): StackedConvLayers( (blocks): Sequential( (0): ConvDropoutNormNonlin( (conv): Conv3d(128, 256, kernel_size=(3, 3, 3), stride=(2, 2, 2), padding=(1, 1, 1)) (instnorm): InstanceNorm3d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False) (lrelu): LeakyReLU(negative_slope=0.01, inplace=True) ) (1): ConvDropoutNormNonlin( (conv): Conv3d(256, 256, kernel_size=(3, 3, 3), stride=(1, 1, 1), padding=(1, 1, 1)) (instnorm): InstanceNorm3d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False) (lrelu): LeakyReLU(negative_slope=0.01, inplace=True) ) ) ) (4): StackedConvLayers( (blocks): Sequential( (0): ConvDropoutNormNonlin( (conv): Conv3d(256, 320, kernel_size=(3, 3, 3), stride=(2, 2, 2), padding=(1, 1, 1)) (instnorm): InstanceNorm3d(320, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False) (lrelu): LeakyReLU(negative_slope=0.01, inplace=True) ) (1): ConvDropoutNormNonlin( (conv): Conv3d(320, 320, kernel_size=(3, 3, 3), stride=(1, 1, 1), padding=(1, 1, 1)) (instnorm): InstanceNorm3d(320, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False) (lrelu): LeakyReLU(negative_slope=0.01, inplace=True) ) ) ) (5): StackedConvLayers( (blocks): Sequential( (0): ConvDropoutNormNonlin( (conv): Conv3d(320, 320, kernel_size=(3, 3, 3), stride=(1, 2, 1), padding=(1, 1, 1)) (instnorm): InstanceNorm3d(320, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False) (lrelu): LeakyReLU(negative_slope=0.01, inplace=True) ) (1): ConvDropoutNormNonlin( (conv): Conv3d(320, 320, kernel_size=(3, 3, 3), stride=(1, 1, 1), padding=(1, 1, 1)) (instnorm): InstanceNorm3d(320, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False) (lrelu): LeakyReLU(negative_slope=0.01, inplace=True) ) ) ) (6): Sequential( (0): StackedConvLayers( (blocks): Sequential( (0): ConvDropoutNormNonlin( (conv): Conv3d(320, 320, kernel_size=(3, 3, 3), stride=(1, 2, 1), padding=(1, 1, 1)) (instnorm): InstanceNorm3d(320, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False) (lrelu): LeakyReLU(negative_slope=0.01, inplace=True) ) ) ) (1): StackedConvLayers( (blocks): Sequential( (0): ConvDropoutNormNonlin( (conv): Conv3d(320, 320, kernel_size=(3, 3, 3), stride=(1, 1, 1), padding=(1, 1, 1)) (instnorm): InstanceNorm3d(320, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False) (lrelu): LeakyReLU(negative_slope=0.01, inplace=True) ) ) ) ) ) (td): ModuleList() (tu): ModuleList( (0): ConvTranspose3d(320, 320, kernel_size=(1, 2, 1), stride=(1, 2, 1), bias=False) (1): ConvTranspose3d(320, 320, kernel_size=(1, 2, 1), stride=(1, 2, 1), bias=False) (2): ConvTranspose3d(320, 256, kernel_size=(2, 2, 2), stride=(2, 2, 2), bias=False) (3): ConvTranspose3d(256, 128, kernel_size=(2, 2, 2), stride=(2, 2, 2), bias=False) (4): ConvTranspose3d(128, 64, kernel_size=(2, 2, 2), stride=(2, 2, 2), bias=False) (5): ConvTranspose3d(64, 32, kernel_size=(2, 2, 2), stride=(2, 2, 2), bias=False) ) (seg_outputs): ModuleList( (0): Conv3d(320, 3, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False) (1): Conv3d(320, 3, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False) (2): Conv3d(256, 3, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False) (3): Conv3d(128, 3, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False) (4): Conv3d(64, 3, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False) (5): Conv3d(32, 3, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False) ) ) 2022-06-22 13:46:45.829525:

2022-06-22 13:46:45.829684: epoch: 0 cur time 0, cur batch 1/50

Looking for your advice.

susymaijer commented 2 years ago

More cpu's seems like the obvious answer? Or does your colleague use the exact same gpu-cpu config?

jjjcs commented 2 years ago

My problem is looks like this one, training stucked at epoch 0, each epoch takes hours to be trained. https://github.com/MIC-DKFZ/nnUNet/issues/443#issuecomment-815500244 As he meitioned the RAM memory should be one of main the bottlenecks. Every time i start a training, there are always some python processes, which look like the same, occupied all cpu. The real training process only get very few resources. According to nvidia-smi, the real process should be the 50325. Do you know why that happened? Or the programm should be like that?

jjjcs commented 2 years ago

More cpu's seems like the obvious answer? Or does your colleague use the exact same gpu-cpu config?

Thanks you very much for your reply. CPU may be the problem, but as in the write, cpu is occupied by some strange processes everytime i start a training. How could that happened? I dont use the same cpu-gpu with my colleague.

susymaijer commented 2 years ago

I don't think there's a 1-1 mapping to the PID shown by nvidia-smi and the process that should use the most cpu. It looks like nvidia-smi shows the process that uses the most memory of those type of processes or something? At least, that's the case for you and also for my nvidia-smi output. And because I see at least 12 processes spawned by nnU-net, and 10 of them use almost 99.7% or 100% CPU! So your cpu's really are doing a lot.

For me I observed this when I used a stronger gpu. Then the cpu's are the bottleneck, causing my gpu to be basically idle and it really slowed down my performance. However your training times are really really extremely long.. That should not be the case. Did you try the dummy dataloader experiment? (https://github.com/MIC-DKFZ/nnUNet/blob/master/documentation/expected_epoch_times.md)

Also, for me this was REALLY important: Install PyTorch as described on their website (conda/pip). Please install the latest version and (IMPORTANT!) choose the highest CUDA version compatible with your drivers for maximum performance. DO NOT JUST PIP INSTALL NNUNET WITHOUT PROPERLY INSTALLING PYTORCH FIRST

It made a huuuuuuge difference which I did not expect.

BigPandaCPU commented 2 years ago

It's strange that the target spacing is array([0.88085902, 0.60000002, 0.88085902]), the data augmentation genarator get the train data so slow. while I change the target spacing to array([0.97656202, 0.625, 0.97656202]) , the data augmentation generator works well.

jjjcs commented 2 years ago

It's strange that the target spacing is array([0.88085902, 0.60000002, 0.88085902]), the data augmentation genarator get the train data so slow. while I change the target spacing to array([0.97656202, 0.625, 0.97656202]) , the data augmentation generator works well.

A smaller spacing results larger data, which takes longer time to train. But could this cause that huge difference in epoch time

MIC-DKFZ / nnUNet

Epoch time extremely slow #1086