AutoML indicates training for 10 Epochs but then trains for 500 when reaching 10

adhusch commented 1 year ago

Hi,

I am doing some experiments with the AutoML CLI, which is really nice :) Training data are 3D nifti volumes

I've just obeserved now that it originally indicates that it would be training for 10 Epochs (which seems appropriate for the simple test case, arguably even too much), but when it reaches the 10th Epoch, it continues training and wants to go for 500 epochs:

Epoch 1/10
126/126 [==============================] - ETA: 0s - loss: 0.3482 - auc: 0.8843 - f1_score: 0.7215     
Epoch 1: val_loss improved from inf to 0.05907, saving model to /home/users/ahusch/aucmedi.data/output/train_aucmedi_automl_mriclass3176948/model.best_loss.hdf5
126/126 [==============================] - 2469s 19s/step - loss: 0.3482 - auc: 0.8843 - f1_score: 0.7215 - val_loss: 0.0591 - val_auc: 0.9924 - val_f1_score: 0.9474 - lr: 1.0000e-04
.
.
.
Epoch 10/10
126/126 [==============================] - ETA: 0s - loss: 0.0257 - auc: 0.9981 - f1_score: 0.9772     
Epoch 10: val_loss did not improve from 0.01386
126/126 [==============================] - 2187s 17s/step - loss: 0.0257 - auc: 0.9981 - f1_score: 0.9772 - val_loss: 0.0184 - val_auc: 0.9988 - val_f1_score: 0.9888 - lr: 1.0000e-04
Epoch 11/500
126/126 [==============================] - ETA: 0s - loss: 0.1045 - auc: 0.9906 - f1_score: 0.9397 
.
.
.

The 500 is correct according to the defaults and the .json, however the displayed 10 is confusing.

muellerdo commented 1 year ago

Hello @adhusch,

thanks for the kind words, always happy to hear that AUCMEDI is useful for other researchers! :)

I understand the confusion. Sadly, this can not be changed (according to my knowledge) with the Tensosrflow backend.

The reason for this is that AUCMEDI AutoML runs 2 training processes through transfer learning. 1) ‘shallow-tuning’ phase 2) ‘fine-tuning’ phase

If I may cite out of my dissertation:

For the shallow-tuning phase, the neural network model starts an initial training process based on weights from a pre-fitted model. For this initial training process, all layers except for the classifier are frozen, a high learning rate is selected (for example 1E -04 ), and the model is fitting for a small number of epochs (commonly 5-15 epochs) [11, 207, 209]. The concept of shallow-tuning is that the model classifier can adapt the fixed architecture weights to the task. After this initial adaption phase, the architecture weights get unfrozen and the second training process is started with regular hyperparameters but a smaller learning rate than for the shallow-tuning phase (for example 1E -05 ) [11, 207, 209]. In this phase, the complete neural network model fine-tunes all weights for the task to obtain optimal performance.

In order to achieve this in Tensorflow, it is sadly required to run 2x training processes resulting into the 0-10 epoch and 10-500 epoch displays. Don't worry, the training will probably not last 500 epochs but instead will automatically stop if no more validation loss improvement will be observed (in order to avoid overfitting).

For simple & quick training runs, you can, however, decrease the number of epochs as an argument. For example: aucmedi training --epochs 25

Hope that I was able to provide some insights.

Best Regards, Dominik

adhusch commented 1 year ago

Hi @muellerdo ,

thank you very much for the detailed explanations, that's very much appreciated.:)

I know the concept of freezing / head only training for transfer learning; but I had not understood from the docs that the AutoML CLI applies this. It's a pity that this is not otherwise possible in TF, in fast.ai this is done super nice with the freeze/unfreeze. Anyway.

Might be reasonable to have two epoch parameters then? One for the classifier fine-tuning and one for the full training of the whole network, including the initial layers? By that, the user could even choose to skip the fine tuning part by setting "epochs_classifier_head=0" (or the other way around, skip the training of the full network in the sense "epochs_full_net=0", for example for my real-life toy problem the "semi-random" encoding from the frozen pre-trained network is already good enough to reach AUC≥.99 with only training the classifier head for a few epochs).

Issue could be closed then. Thanks again!

P.S.: I might have a few more small questions, what's your preferred workflow for that? Opening issues or just dropping a mail for example?

frankkramer-lab / aucmedi

AutoML indicates training for 10 Epochs but then trains for 500 when reaching 10 #211