cchallu / n-hits

170 stars 25 forks source link

different settings (nhits vs. autoformer) #5

Closed ResearcherLifeng closed 2 years ago

ResearcherLifeng commented 2 years ago

Hi! Thank you for sharing your source code.

I have some questions about the settings of NHITS and Autoformer.

I think there might be some unfair comparisons in your Tab 2 because you compared the Autoformer's reported results but used different settings in the NHITS model.

Q1: the length of the history window you use 5args.horizon for NHITS. But for Autoformer, you use a shorter length (say, 1args.horizon.) Here args.horizon=96.

When using a history length of 5*96, your reported result of ECL-96 is 0.147 (I can reproduce this by re-running your released code). The Autoformer's reported result is 0.201 (use only a 96-length window).

I tried some experiments and get results as follows:

using the same setting for NHITS (96-length window), the result of ECL-96 is MSE: 0.1902 / MAE: 0.2739

it seems the length of history window is an important hyperparameter.

By the way, using 5*96-length window for NBeats model, I get a much better result of ECL-96 is MSE: 0.1340 / MAE: 0.2311

Q2: the spilt of train/val/test set you use masks (train_mask_df, valid_mask_df, test_mask_df) to indicate the parts of train/valid/test. However, in autoformer's setting (see https://github.com/thuml/Autoformer) the borders are

border1s = [0, num_train - self.seq_len, len(df_raw) - num_test - self.seq_len] border2s = [num_train, num_train + num_vali, len(df_raw)]

Here, it seems you did not use the overlap part like [num_train - self.seq_len, num_train + num_vali]

So my question here is whether the same number of test samples are used for evaluation. If not, I think it might be unfair to directly compare Autoformer's results in your Tab 2.

cchallu commented 2 years ago

Hi @ResearcherLifeng, thanks for your interest in our paper, regarding your questions:

Q1. For the Transformer-based baselines, we report the results from the Autoformer paper, these results can be validated in our code as we use their best-reported hyperparameters. As shown in Table 4 of Autoformer's paper, the best input size they found is 96, and a longer history window actually underperformed. This is an Autoformer's drawback since it can't exploit the additional information in a longer window to produce more accurate forecasts.

Q2. Our pipeline operates differently yet produces the same train/validation/test splits. We don't need to define separate borders for the input and output. Our Dataloader automatically sends the overlapping part as input. Yes, the same number of test samples are used.

ResearcherLifeng commented 2 years ago

Thank you for the quick reply! I have no additional questions now.