Average of different prediction horizons as a metric?

santoshatchi commented 1 year ago

Hello Authors,

Could you please clarify the usage of the average of different prediction horizons as a benchmarking metric? Why was it used, and how to justify the validity of this?

I am doing a similar project and trying to report values at different horizons. My model is not getting values close to those reported in SOTA (top 5) models like yours. Could you please help with the intuition on reporting the average rather than individual horizons?

Thanks Santosh

jakegrigsby commented 1 year ago

IIRC that's a convention inherited by Informer and the followup works to it that have come out since this repo's initial release and before it's more recent versions. The accuracy of individual timesteps into the future can be arbitrary and hard to interpret. 1 step predictions are too easy, but distant predictions can be very difficult given a fixed length context window which may be too short. In highly periodic domains some distant horizons can also be easy (such as 24 hours ahead in a dataset with clear daily periodicity like weather forecasting). So reporting every horizon metric takes a lot of explaining, large tables, and can be misleading. Averaging gives a better sense of the model's performance over the entire duration we care about.

At a few points during this project I hacked together logging metrics for accuracy at each individual timestep as a sanity-check. In my experience you can expect a roughly linearly increasing error as you predict further into the future.

As far as replicating the results on these datasets in your own project, double check that you aren't counting missing datapoints in the metrics. This can make a huge difference and is something a lot of the literature (and early versions of this codebase) get wrong.

steve3nto commented 8 months ago

I agree with Jake, averaging over the whole prediction horizon makes sense in order to compare single numbers as a metric. It is a pity though that different benchmarks use different metrics. For example, check here for PEMS-Bay: https://paperswithcode.com/sota/traffic-prediction-on-pems-bay

They report RMSE (I guess this is averaged over the whole horizon) and MAE @ 12 step (this is for a single prediction 12 steps into the future)

It would be good to have more standardized metrics. In the paper there is no RMSE for PEMS-Bay. There is MAE, MSE and MAPE, but unfortunately PapersWithCode does not report those.

This is not a question, just a comment, sorry for the spam! 😁

jakegrigsby commented 8 months ago

Yeah the traffic datasets / literature is the main example where reporting multiple horizons is the default. The longest horizons are 12 timesteps so this can be feasible. Once you get longer than that it stops making sense to report arbitrary intervals in tables in my opinion. It would be interesting if the convention for reporting forecasting results was a plot of error over forecast duration for each dataset. That wasn't necessary at the time (2021) but I think this is probably what I would do if I were to redo this project today...

QData / spacetimeformer

Average of different prediction horizons as a metric? #66