`save_model_frequency` not applied when not divisible by `eval_frequency`

thvasilo commented 1 month ago

In order to be able to save topk models during training we use their performance on the validation set to rank them.

Right now we rely on evaluation metrics calculated during "regular" evaluation runs that run every eval_frequency to do the ranking.

We implicitly assume that save_model_frequency should be divisible by eval_frequency to do so.

However, when that is not true, the user's choice of save_model_frequency is not respected, which could even lead to no models being saved (?).

We should change the behavior to always perform an eval at save_model_frequency steps iff topk_models_to_save is set by the user. If topk_models_to_save is not set, we should just save the model without eval.

I also observed that no evaluation is run in one case, (save_model_frequency 10k, eval_frequency 4k). We'll need to investigate what happens there. I'll run some quick confirmation experiments for this.

zhjwy9343 commented 4 weeks ago

The current logics of these 3 arguments are:

===== Command Line Interface =======

if set topk_models_to_save, we will check if save_model_frequency is divisible by eval_frequency. if not divisible, break and raise an error.
if not set topk_models_to_save, models will be saved in two cases:
- after each epoch, automatically saving;
- in the common multiple iteration of save_model_frequency and eval_frequency, e.g., 20k, 40k, 60k, ..., for 10k save_model_frequency and 4k eval_frequency. This is because current model saving depends on if model evaluation is done in the same iteration.
model evaluation is independent from model saving, i.e., when validation dataloader is not none, and not no_validation, always do validation according to eval_frequency.
the best model path is based on both evaluation and save model operations.

===== API =======

by default, topk_models_to_save==1, which means only save the best one model.
after each epoch, automatically save model or do evaluation if validation dataloader is not none, and not no_validation.
only in the common multiple iteration of save_model_frequency and eval_frequency, e.g., 20k, 40k, 60k, ..., for 10k save_model_frequency and 4k eval_frequency. This is because current model saving depends on if model evaluation is done in the same iteration.

zhjwy9343 commented 4 weeks ago

Based on the current logic mentioned above, the design decision would be:

Within an epoch, should the model saving be independent from model evaluation? Or,
Should force doing evaluation (validation dataloader is not none, and not no_validation) whenever save model?

I will vote for the 2nd option from a user perspective.

awslabs / graphstorm

`save_model_frequency` not applied when not divisible by `eval_frequency` #865