This is the official implementation of FITS. Please run the scripts in scripts\FITS for results. Scripts without _best
are for ablation study and grid search for parameters. Scripts with _best
are for multiple run on the optimal parameters.
See updates here: Update
Wanna see something beyond FITS? Check:
"Beyond Trend and Periodicity: Guiding Time Series Forecasting with Textual Cues" Paper Code Dataset
We've identified a significant bug in our code, originally found in Informer (AAAI 2021 Best Paper), thanks to Luke Nicholas Darlow from the University of Edinburgh. This issue has implications for a broad spectrum of research on time series forecasting, including but not limited to:
Efforts are underway to correct this bug, and we will update our Arxiv submission and this repository with the revised results. A bug fix method will also be released to assist the community in addressing this issue in their work.
The bug stems from an incorrect implementation in the data loader. Specifically, the test dataloader uses drop_last=True
, which may exclude a significant portion of test data, particularly with large batch sizes, leading to unfair model comparisons.
To fix this issue in codebases using LSTF-Linear's architecture:
In data_factory.py within the data_provider folder (usually on line 19), change:
if flag == 'test':
shuffle_flag = False
drop_last = True
batch_size = args.batch_size
freq = args.freq
To:
if flag == 'test':
shuffle_flag = False
drop_last = False #True
batch_size = args.batch_size
freq = args.freq
In your experiment script (e.g., ./exp/exp_main.py), modify the following (around line 290):
From:
preds = np.array(preds)
trues = np.array(trues)
inputx = np.array(inputx) # some times there is not this line, it does not matter
To:
preds = np.concatenate(preds, axis=0)
trues = np.concatenate(trues, axis=0)
inputx = np.concatenate(inputx, axis=0) # if there is not that line, ignore this
If you do not do this, it will generate an error during testing because of the dimension 0 (batch_size) is not aligned. Maybe this is why everyone is dropping the last batch. But concatenate them on the 0 axis (batch_size) can solve this problem.
Run the officially provided scripts!
The best result is in bold and the second best is in italic. The results are reported in terms of MSE. This is still preliminary results for FITS. We are rerunning the parameter search, ablation study and multi-runs for the final results. The final results will be updated in the paper. Following are our final results. We have reported these results in the ICLR final version.
Model | ETTh1-96 | ETTh1-192 | ETTh1-336 | ETTh1-720 | ETTh2-96 | ETTh2-192 | ETTh2-336 | ETTh2-720 | ETTm1-96 | ETTm1-192 | ETTm1-336 | ETTm1-720 | ETTm2-96 | ETTm2-192 | ETTm2-336 | ETTm2-720 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
PatchTST | 0.385 | 0.413 | 0.44 | 0.456 | 0.274 | 0.338 | 0.367 | 0.391 | 0.292 | 0.33 | 0.365 | 0.419 | 0.163 | 0.219 | 0.276 | 0.368 |
Dlinear | 0.384 | 0.443 | 0.446 | 0.504 | 0.282 | 0.350 | 0.414 | 0.588 | 0.301 | 0.335 | 0.371 | 0.426 | 0.171 | 0.237 | 0.294 | 0.426 |
FedFormer | 0.375 | 0.427 | 0.459 | 0.484 | 0.340 | 0.433 | 0.508 | 0.480 | 0.362 | 0.393 | 0.442 | 0.483 | 0.189 | 0.256 | 0.326 | 0.437 |
TimesNet | 0.384 | 0.436 | 0.491 | 0.521 | 0.340 | 0.402 | 0.452 | 0.462 | 0.338 | 0.374 | 0.410 | 0.478 | 0.187 | 0.249 | 0.321 | 0.408 |
FITS | 0.372 | 0.404 | 0.427 | 0.424 | 0.271 | 0.331 | 0.354 | 0.377 | 0.303 | 0.337 | 0.366 | 0.415 | 0.162 | 0.216 | 0.268 | 0.348 |
IMP | 0.003 | 0.009 | 0.013 | 0.032 | 0.003 | 0.007 | 0.013 | 0.014 | -0.011 | -0.007 | -0.001 | 0.004 | 0.001 | 0.003 | 0.008 | 0.020 |
Model | Weather-96 | Weather-192 | Weather-336 | Weather-720 | Electricity-96 | Electricity-192 | Electricity-336 | Electricity-720 | Traffic-96 | Traffic-192 | Traffic-336 | Traffic-720 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
PatchTST | 0.151 | 0.195 | 0.249 | 0.321 | 0.129 | 0.149 | 0.166 | 0.210 | 0.366 | 0.388 | 0.398 | 0.457 |
Dlinear | 0.174 | 0.217 | 0.262 | 0.332 | 0.140 | 0.153 | 0.169 | 0.204 | 0.413 | 0.423 | 0.437 | 0.466 |
Fedformer | 0.246 | 0.292 | 0.378 | 0.447 | 0.188 | 0.197 | 0.212 | 0.244 | 0.573 | 0.611 | 0.621 | 0.630 |
TimesNet | 0.172 | 0.219 | 0.280 | 0.365 | 0.168 | 0.184 | 0.198 | 0.220 | 0.593 | 0.617 | 0.629 | 0.640 |
FITS | 0.143 | 0.186 | 0.236 | 0.307 | 0.134 | 0.149 | 0.165 | 0.203 | 0.385 | 0.397 | 0.410 | 0.448 |
IMP | 0.008 | 0.009 | 0.013 | 0.014 | -0.005 | 0.000 | 0.001 | 0.001 | -0.019 | -0.009 | -0.012 | 0.009 |
The discovered bug predominantly impacts results on smaller datasets like ETTh1 and ETTh2. Interestingly, for other datasets, certain models, such as PatchTST on ETTm1, demonstrate enhanced performance. FITS still maintains its good enough and comparable-to-sota performance.
We have uploaded the training logs for community review. Additionally, we've provided logs for other baseline models. It's important to note that these logs were generated using their respective official codebases, not the versions in this repository.
We will update the training scripts of FITS very soon.
We Have update the training scripts.
For fairness, we have conducted baseline runs using freshly cloned codebases with the original hyperparameters. (Note: Avoid using versions from this repository.) TimesNet, which is unaffected by this issue, was not re-run and is mentioned here only for reference.
We encourage the community to apply the provided bug fix and re-conduct their experiments.
(A minor note: The only change we made in hyperparameters was reducing the learning rate for DLinear on ETTh2 from 0.05 to 0.005, resulting in improved outcomes.)
(A word of caution: Training PatchTST, particularly on datasets like traffic and electricity, can be extremely time-consuming.)
(We failed to reproduce the FiLM result since it takes over 40GB GPU memory and over 2 hour per epoch on an A800. Further, the provided scripts seems to have flaws, i.e. the 'modes1' parameter is set to 1032 in ETTh1 instead of the '32' in others, the train_epoch is 1 in ETTh2 which may result in a downgraded performance. Thus, we exclude FiLM in the following analysis since we can not ensure a fair comparison.)
In previous anomaly detection works, anomaly threshold is calculated based on the test_set, see affected code in Anomaly Transformer. Such setting may violate the assumption that the test_set should be unavailable before deploying the model. Such method may cause information leakage and cherrypicked result on the test_set.
As claimed in the paper, FITS directly uses the validation set for threshold selecting as indicated in code.
However, we still compare FITS with the results reported in their original paper which may have potential information leakage. And we encourage the community to reevaluate the affected methods for further reference. XD
We thank Luke Darlow from the University of Edinburgh who find the bug.