ecmwf / anemoi-training

Apache License 2.0
17 stars 16 forks source link

Rollout Scheduling #145

Open HCookie opened 2 weeks ago

HCookie commented 2 weeks ago

Our current rollout implementation is very focused on sequential epoch increments, it would be good to generalise this to provide schedulers to control rollout.

Work was done in aifs-mono to enable this. here I think this can be generalised and provide more general applicability.

Features

Below is a list of features and requirements as I see them

Improvements

Setup config at begin of training with rollout increment be

Questions

What other features may be needed?

mchantry commented 2 weeks ago

What does static mean? Constant at, say 2? This is already supported. What does dynamic selection mean?

HCookie commented 2 weeks ago

@mchantry Updated the description

mc4117 commented 2 weeks ago

I like the idea of dynamic selection of increments and I was also wondering if this could be done by steps as well as by epochs? For example at step 1000, do roll 2, at step 10000, do roll 10. Also I think this would avoid the issue of if you wanted to do rollout within epochs as you could then define it by steps instead

jakob-schloer commented 2 weeks ago

I agree with @mc4117. Some models show a better performance when trained for longer on 2-steps and only some iterations on longer rollout steps. | I wonder, however, if that could not be solved by limiting the number of batches per epoch and provide a list of rollout lengths, e.g. [2,2,2,2,2,2,2,2,3,4,5,6,...].

anaprietonem commented 1 week ago

I like @mc4117 suggestion regarding supporting rollout by steps. I think this probably would make things easier if, in the future, we want to automate the training so that the 6-hour and the rollout steps are executed one after the other.

HCookie commented 1 week ago

Moving to a discussion (to try it out) https://github.com/ecmwf/anemoi-training/discussions/148