About training procedures and doc

Guan-t7 commented 2 years ago

Update: Additional questions

Your data pipeline seems quite non-traditional for me. At each training step, you randomly sample 256 windows from one time series as model input. A training epoch is finished by sampling each series once. I understand that it's a univariate model, but I don't see why you leave it to probability to cover the entire training span.
I tried an ablation by feeding the data in multivariate fashion, i.e. input a history of all variables, roll windows along time dimension, learning (N, S) -> (N, T) where N == num_series. The result was bad on traffic dataset. Could you help explain?
The paper says that you have lr halved three times across the training procedure. However, you mis-configured your pl_module. The default lr_schedule interval is epoch (ref. https://pytorch-lightning.readthedocs.io/en/latest/common/lightning_module.html#configure-optimizers), which means that you actually kept training with initial lr till the end.
You chose a training step of 1000, which is conservative considering your data feeding. For example, each ts is covered at most twice using traffic dataset. Training more steps slightly improved over your reported results on traffic dataset (at least).

I hope these could help improve your model (Of course the metric presented is already impressive enough :).

=============================================== Thank you for this amazing work. I found these typo and doc issues:

while documented as a multiplier, n_time_in is actually the final Lookback period

n_layers in nhits_multivariate.py should be [ 3*[2] ] rather than 9 since elements are indexed across 3 stacks
loss_hypar should be an int like 7 or 24 from its context
There are bypassed logics for exogenous variables in nhits model. I wonder if they can be put into work now?

cchallu commented 2 years ago

Hi @Guan-t7, thanks for the interest in our paper! Regarding your comments:

We follow the NBEATS' training strategy, which samples random windows at each time and doesn't necessarily cover all the training set. We have used this strategy extensively, and covering all the training set doesn't improve performance in most cases.
Traffic has more than 800 time series, are you sending the history of all the time series? If so, the input vector is larger than 70k, a feed-forward network is not the best architecture to learn interactions between a large number of inputs.
Yes, we realized there was a mistake with the scheduling, in particular with the datasets with more time series such as ECL and Traffic. We plan to fix this soon.
Yes, 1000 iterations is conservative for a dataset as Traffic. However, we wanted to keep the search space constant between the 6 datasets of the paper. The performance can be further improved by exploring different hyperparameters for each dataset separately.

Guan-t7 commented 2 years ago

Cool. Thanks for your reply!

cchallu / n-hits