amazon-science / chronos-forecasting

Chronos: Pretrained (Language) Models for Probabilistic Time Series Forecasting
https://arxiv.org/abs/2403.07815
Apache License 2.0
2.02k stars 238 forks source link

Training and fine tuning protocols #89

Closed tanz63 closed 1 day ago

tanz63 commented 4 weeks ago

In Chronos paper, training is implemented with fixed step number("The models were optimized for 200K steps using the AdamW optimizer with a weight decay of 0.01. The learning rate was annealed linearly from its initial value of 0.001 to 0 over the training steps"). what's the logic behind this configration? Since there is no downstream task fine tuning like LLM counterpart, how is it supposed to avoid overfitting? Are there some heuristics like the step number is equal to 1 or 2 epochs? Also, the fine tuning is implemented in a dataset-agnostic fashion with an initial learning rate of 0.001, annealed linearly to 0 over 1000 steps, what are the insights behind it?

abdulfatir commented 4 weeks ago

@tanz63 these configurations are more or less the defaults from torch and transformers. The 200K steps were set based on the visual inspection of the loss curve (although we did see improved performance with longer training as shown in the hyperparameter analysis).

Regarding fine-tuning, it's a dataset-agnostic proof of concept. We did not deliberate hard about the setting. One could potentially obtain significantly better fine-tuning performance by carefully validating the hyperparameters.