Much slower training times after upgrading from version 1.5 to 1.6

AdamPBerkley commented 3 years ago

Hello Professor Isensee,

I'm trying to use your nnU-Net model to run on fewer modalities than the four provided in the BraTS dataset and then seeing how these trained models do on hospital data from outside the BraTS dataset.

I'm running into trouble after upgrading from version 1.5 to 1.6. Previously each epoch would train in approximately 10 minutes. However, after I've upgraded to the newer version that no longer uses apex I've experienced much greater training times. They are now taking somewhere around 10 hours per epoch. I initially assumed it was a problem with the newer set up not detecting my GPU, but it appears that may not be the case as nvidia-smi shows the gpu is in use. I'm unsure how to proceed with debugging this issue.

EDIT: I may be wrong about it running on the GPU. I used the nvidia-smi command when testing it locally and saw it was at 95% GPU usage, but the extended run times are from the computing cluster at my academic institution (where my setup should be mostly identical except for the hardware being vastly better). However, I decided to double check the run times locally. I see each epoch is still completing in 10 minutes after the upgrade locally. I'm a little confused as the output and training logs looks mostly similar and I've used pytorch in my virtual environment to detect the GPUs on the cluster and verified it can use them, so any advice you have on debugging this and getting the GPUs to work would be greatly appreciated. Thanks for your time!

FabianIsensee commented 3 years ago

Hi, honestly I don't know how I can help you. All I can say is that one epoch should take about 2-3 minutes, not 10 and certianly not 10 hours! It could be that you are having a CPU bottleneck and that it therefore takes 10 minutes. But 10 hours is very wrong and it sounds like it's running on CPU. Best, Fabian

Barnonewdm commented 3 years ago

I meet the same situation in the training stage. The training is too slow to finish one epoch. Instead, I have some experience using the previous version like 10 mins/epoch. I am still debugging the reason.

Barnonewdm commented 3 years ago

I meet the same situation in the training stage. The training is too slow to finish one epoch. Instead, I have some experience using the previous version like 10 mins/epoch. I am still debugging the reason.

This is a follow-up. I have tried to run the training in the docker based on the basic PyTorch docker and install nnunet. Also, the speed is quite low. Even one epoch can not finish within my patience (like 10 hours).

FabianIsensee commented 3 years ago

Are you certain that you are actually training on GPU?

Barnonewdm commented 3 years ago

Are you certain that you are actually training on GPU?

100%. The GPU was actually working as shown in the following. I am keeping checking the bottleneck now. Any suggestion is welcome. Screenshot from 2021-04-08 14-19-18

FabianIsensee commented 3 years ago

It looks to me like your system is not correctly configured. The GPU is not working as you can see by the low power consumption. I don't know what can cause this though. (during normal training the power consumption of the GPU will be close to max). Maybe install an older version of pytorch (or on with an older cuda version)? I don't know it TitanX is still supported. Best, Fabian

Barnonewdm commented 3 years ago

There should be something wrong. The log file is uploaded now, as shown in the following. training_log_2021_4_8_10_42_53.txt

I have tried to use an older version of Cuda. It produces the same stuck.

FabianIsensee commented 3 years ago

it would be more important to have the stdour and stderr from the training because they are not fully logged in to the log file

Barnonewdm commented 3 years ago

it would be more important to have the stdour and stderr from the training because they are not fully logged in to the log file

In my view, stdour and stderr is updated now in this attached file. training_log_2021_4_8_10_42_53.txt

FabianIsensee commented 3 years ago

This is just the nnU-Net log file. Please send whet your terminal is displaying. Best, Fabian

Barnonewdm commented 3 years ago

This is just the nnU-Net log file. Please send whet your terminal is displaying. Best, Fabian

It exactly combines the terminal output and the log information.

After checking, the RAM memory should be one of main the bottlenecks. After cleaning all of the RAM-consuming processes, each epoch can be finished within 10 mins now. And, our workstation has only 16G memory.

FabianIsensee commented 3 years ago

Alright. Glad to hear it works now. The TitanX is unfortunately not very fast because it doesn't support tensor cores. You can find reference numbers for epoch times in https://github.com/MIC-DKFZ/nnUNet/blob/master/documentation/expected_epoch_times.md

MIC-DKFZ / nnUNet

Much slower training times after upgrading from version 1.5 to 1.6 #443