How to handle Missing values in TFT/NHITS

Nixtla / neuralforecast

Scalable and user friendly neural :brain: forecasting algorithms.

https://nixtlaverse.nixtla.io/neuralforecast

Apache License 2.0

2.91k stars 333 forks source link

How to handle Missing values in TFT/NHITS #567

Open tinased95 opened 1 year ago

tinased95 commented 1 year ago

Hello, my time series data has some large missing gaps, I was wondering if it is possible to remove the rows with missing value instead of applying data interpolation methods to clean the data? Should the data be continuous or is it okay to not have some rows in the data? Thanks.

kdgutier commented 1 year ago

Hi @tinased95,

Regarding missing data:

Unfortunatedly, our methods operate under the assumption of series being uniformly sampled, with missing rows they might have troubles.

Some solutions:

Filter the data to include only the latest complete time intervals available, it is often the case that restricting the time frame is beneficial to the forecasting model as you allow it to focus on the most recent observations.
If you still want to have all the available information you may be able to hack the data and treat unconnected time intervals as different series.
You can define a forecasting problem at a lower resolution, summing your dataset entries on a different time scale. For example transforming weekly data into monthly data. You can use some specialized techniques like TopDown disaggregation to generate finer granularity in a second stage.
Finally, you can wait for us to include our most recent SpectraNet research on the missing data topic into the library, or go directly to the SpectraNet dedicated repository.

Let us know.

tinased95 commented 1 year ago

Thank you so much for your helpful solutions. Could you please elaborate more on your second solution? How can I use the multiple unconnected series for the model?

kdgutier commented 1 year ago

The idea for unconnected time series, would operate as follows: Suppose that you have a series $y[:t]=[y{1},y{2},...,y{t}]$, and it has two connected blocks, with missing values in between $y[:t1]=[y{1},y{2},...,y{t1}]$ and $y[t2:t]=[y{1},y{2},...,y_{t1}]$ you could hack the problem and treat these blocks as individual series in the panel. Only keeping the predictions for $y[t2:t]$. The idea might be a stretch of NeuralForecast functionalities, I would try first filtering or dig into SpectraNet.

tinased95 commented 1 year ago

Thanks, I'm not sure how to pass these individual series into the model's fit() or predict() functions. Should they be treated as different targets i.e. multi-variate forecasting? but in that case the 'ds' which is for the time would not be aligned between the series.

cchallu commented 1 year ago

Hi @tinased95. We added the possibility of specifying an available_mask in the input dataframe (as a new column, 1:available, 0:missing). You can impute missing values (so models can use them as inputs). The mask will weight the training loss to prevent learning imputed values.

tinased95 commented 1 year ago

Hi @cchallu, that's great. Thank you so much for letting me know.

omcandido commented 1 month ago

Hi, I was also wondering how missing rows were handled in windows-sampling models.

I thought that the idea was that each sample would contain n_past + n_horizon rows, where the contribution of values (past and future) from rows where available_mask==0 would be ignored towards the loss computation. Looking at how _compute_weights() is used by the losses, it looks like this is implemented simply by zeroing out the contributing target(s) with a weight of 0.

Is this common practice? I could not find any mention of similar approaches in any paper (maybe you can point me to a source where the sampling mechanism and loss calculation are explained in detail?)
Is anything done to minimise the update of weights associated with past targets/covariates from missing rows (e.g., a missing past data point is set to 0, but its value propagates to all labels, including those with a mask value of 1)?

Thanks