Is [holding out the last H observations of each time series as a test set] implemented in the fine-tuning code?

amazon-science / chronos-forecasting

Chronos: Pretrained (Language) Models for Probabilistic Time Series Forecasting

https://arxiv.org/abs/2403.07815

Apache License 2.0

2.53k stars 288 forks source link

Is [holding out the last H observations of each time series as a test set] implemented in the fine-tuning code? #198

Open Xiao-congxi opened 1 week ago

Xiao-congxi commented 1 week ago

Hi Chronos team, I notice in the paper that you used the last H observations of each time series as a held-out test set (all models are judged by the accuracy of their forecast on such held-out set, which no model had access to for training purposes).

However, I haven't found the implementation of this part in file train.py. Could you please guide me on which part of the code implements this?

lostella commented 6 days ago

@Xiao-congxi this is indeed nowhere in the code. This happened prior to applying TSMixup (see Algorithm 1 in Appendix A https://arxiv.org/pdf/2403.07815): the last H observations from each series of each dataset (note that H is dataset dependent) were sliced out before calling the TSMixup routine. The mixed data was stored (you find it at https://huggingface.co/datasets/autogluon/chronos_datasets/tree/main/training_corpus) and used directly with train.py, which at that point requires no further slicing of the data, for training the model.

Xiao-congxi commented 6 days ago

Thank you for your prompt response! This approach is indeed convenient. With that in mind, I have another question. In your paper, when conducting in-domain evaluation on a specific dataset in benchmark1 (for example, electricity_hourly), since it has already been mixed into the pre-training data, does that mean there’s no need for additional fine-tuning, but directly making the prediction?

lostella commented 6 days ago

For the in-domain evaluation (datasets in Benchmark I) we did not do any fine-tuning, but directly predicted as you say.

One could do additional fine-tuning, and potentially see improvements, but we did not run that experiment. We only tested fine-tuning on Benchmark II datasets (see Figure 6).

Xiao-congxi commented 6 days ago

I understand. Thanks again for your help!