Unexpected Convergence Dynamics in nnUNet with Increasing Epochs

Hello,

My team and I have encountered an unusual issue regarding the convergence dynamics of nnUNet. Initially, we trained nnUNet for 2000 epochs and observed the following learning curve:

progress_2000epochs

Given the trend of the metrics (in green), we concluded that the network had not fully converged, so we retrained the model from scratch (using the same dataset and train/val splits), this time specifying 3000 epochs. The resulting curve was as follows:

progress_3000epochs

Once again, based on the metric curve, we decided to extend the training duration. We then retrained the network (again on the same data) for 5000 epochs from scratch and obtained this progress curve:

progress_5000epochs

It appears that even after 5000 epochs, the training has not fully converged. Additionally, the longer the specified number of epochs, the more time the network takes to reach convergence: the overall metric achieved 0.85 after 2000 epochs in the first experiment, after 3000 epochs in the second experiment, and after 5000 epochs in the third experiment.

Have you encountered this kind of behavior before? Do you have any recommendations? We are considering lowering the initial learning rate (e.g., 10^-3 instead of 10^-2).

Many thanks in advance for your help!

Noëlie

MIC-DKFZ / nnUNet

Unexpected Convergence Dynamics in nnUNet with Increasing Epochs #2474