MIC-DKFZ / nnUNet

Apache License 2.0
5.83k stars 1.75k forks source link

Unexpected Convergence Dynamics in nnUNet with Increasing Epochs #2474

Open ndebs opened 2 months ago

ndebs commented 2 months ago

Hello,

My team and I have encountered an unusual issue regarding the convergence dynamics of nnUNet. Initially, we trained nnUNet for 2000 epochs and observed the following learning curve:

progress_2000epochs

Given the trend of the metrics (in green), we concluded that the network had not fully converged, so we retrained the model from scratch (using the same dataset and train/val splits), this time specifying 3000 epochs. The resulting curve was as follows:

progress_3000epochs

Once again, based on the metric curve, we decided to extend the training duration. We then retrained the network (again on the same data) for 5000 epochs from scratch and obtained this progress curve:

progress_5000epochs

It appears that even after 5000 epochs, the training has not fully converged. Additionally, the longer the specified number of epochs, the more time the network takes to reach convergence: the overall metric achieved 0.85 after 2000 epochs in the first experiment, after 3000 epochs in the second experiment, and after 5000 epochs in the third experiment.

Have you encountered this kind of behavior before? Do you have any recommendations? We are considering lowering the initial learning rate (e.g., 10^-3 instead of 10^-2).

Many thanks in advance for your help!

Noëlie

25benjaminli commented 2 months ago

@ndebs as you said, I think this is likely due to the nnunet learning rate scheduler. Reducing the learning rate is certainly an option to see if it would converge better, or you can override the default learning rate scheduler provided by nnunet to decay the LR faster.