Open ndebs opened 2 months ago
@ndebs as you said, I think this is likely due to the nnunet learning rate scheduler. Reducing the learning rate is certainly an option to see if it would converge better, or you can override the default learning rate scheduler provided by nnunet to decay the LR faster.
Hello,
My team and I have encountered an unusual issue regarding the convergence dynamics of nnUNet. Initially, we trained nnUNet for 2000 epochs and observed the following learning curve:
Given the trend of the metrics (in green), we concluded that the network had not fully converged, so we retrained the model from scratch (using the same dataset and train/val splits), this time specifying 3000 epochs. The resulting curve was as follows:
Once again, based on the metric curve, we decided to extend the training duration. We then retrained the network (again on the same data) for 5000 epochs from scratch and obtained this progress curve:
It appears that even after 5000 epochs, the training has not fully converged. Additionally, the longer the specified number of epochs, the more time the network takes to reach convergence: the overall metric achieved 0.85 after 2000 epochs in the first experiment, after 3000 epochs in the second experiment, and after 5000 epochs in the third experiment.
Have you encountered this kind of behavior before? Do you have any recommendations? We are considering lowering the initial learning rate (e.g., 10^-3 instead of 10^-2).
Many thanks in advance for your help!
Noëlie