Closed tanz63 closed 1 day ago
@tanz63 these configurations are more or less the defaults from torch
and transformers
. The 200K steps were set based on the visual inspection of the loss curve (although we did see improved performance with longer training as shown in the hyperparameter analysis).
Regarding fine-tuning, it's a dataset-agnostic proof of concept. We did not deliberate hard about the setting. One could potentially obtain significantly better fine-tuning performance by carefully validating the hyperparameters.
In Chronos paper, training is implemented with fixed step number("The models were optimized for 200K steps using the AdamW optimizer with a weight decay of 0.01. The learning rate was annealed linearly from its initial value of 0.001 to 0 over the training steps"). what's the logic behind this configration? Since there is no downstream task fine tuning like LLM counterpart, how is it supposed to avoid overfitting? Are there some heuristics like the step number is equal to 1 or 2 epochs? Also, the fine tuning is implemented in a dataset-agnostic fashion with an initial learning rate of 0.001, annealed linearly to 0 over 1000 steps, what are the insights behind it?