Nixtla / neuralforecast

Scalable and user friendly neural :brain: forecasting algorithms.
https://nixtlaverse.nixtla.io/neuralforecast
Apache License 2.0
2.9k stars 332 forks source link

Intermittent data shortens predict_insample() date range #718

Open tg2k opened 1 year ago

tg2k commented 1 year ago

What happened + What you expected to happen

It seems that intermittent data causes issues with predict_insample(). If any intervals are missing all data, then the problem becomes clear: the range of dates returned will be shortened based on how densely populated the date range is.

A cursory reading of https://nixtla.github.io/neuralforecast/examples/intermittentdata.html suggests that this scenario should work though various pages like https://nixtla.github.io/neuralforecast/examples/getting_started_complete.html#evaluate-the-models-performance say otherwise. It may be helpful for the former page to have a note that the data must still be contiguous overall even if some of it is sparse.

The rest applies only if this is something to address in neuralforecast.

I believe the problems start with how self.last_dates is set:

            self.dataset, self.uids, self.last_dates, self.ds = self._prepare_fit(
                df=df, static_df=static_df, sort_df=sort_df

There is no corresponding self.first_dates, which could possibly be used to interpolate dates on the expected frequency (self.freq). When predict_insample() is called, it calls _insample_dates() with a len_series value that is then used with _cv_dates() to produce a date range via pd.date_range() using the end date, with a number of periods based on the size of the data set rather than on its actual range.

The _prepare_fit() also has a comment:

# TODO: uids, last_dates and ds should be properties of the dataset class. See github issue.

From the Git blame perhaps it's talking about https://github.com/Nixtla/neuralforecast/pull/348 / https://github.com/Nixtla/neuralforecast/pull/354 ?

One workaround is to convert from using dates in the ds column to using ints. I've included code to demonstrate the issue, with code blocks marked off that both trigger the issue and also work around it.

Versions / Dependencies

1.6.1

Reproduction script

# Code from https://nixtla.github.io/neuralforecast/examples/predictinsample.html

import pandas as pd

def reduce_long_dataframe(df, numerator, denominator):
    unique_ds_values = df['ds'].unique()
    remove_ds_values = []

    # Iterate through unique_ds_values, selecting every 'numerator' ds values out of 'denominator' to remove
    for i in range(0, len(unique_ds_values), denominator):
        remove_ds_values += list(unique_ds_values[i:i+numerator])

    # Keep only rows that don't have ds values in remove_ds_values
    reduced_df = df[~df['ds'].isin(remove_ds_values)]

    reduced_df = reduced_df.reset_index(drop=True)
    return reduced_df

from neuralforecast.utils import AirPassengersDF
Y_df = AirPassengersDF # Defined in neuralforecast.utils

Y_df['unique_id'] = Y_df['unique_id'].astype(str)
print(Y_df.head())

# comment out to keep the full set of dates (which will work as expected)
print(f"Datestamp range of Y_df before reduce_long_dataframe(): {Y_df['ds'].min()} to {Y_df['ds'].max()}")
Y_df = reduce_long_dataframe(Y_df, 2, 3)
print(f"Datestamp range of Y_df after reduce_long_dataframe(): {Y_df['ds'].min()} to {Y_df['ds'].max()}")

# workaround: replace Y_df['ds'] with a range of integers starting at 0
def replace_ds_with_contiguous_integers(df):
    unique_dates = sorted(df['ds'].unique())
    date_to_int_mapping = {date: idx for idx, date in enumerate(unique_dates)}
    df['ds'] = df['ds'].map(date_to_int_mapping)
    return df

# uncomment to work around intermittent handling issues with non-contiguous ds values
# Y_df = replace_ds_with_contiguous_integers(Y_df)  
# print(f"ds range of Y_df after replace_ds_with_contiguous_integers(): {Y_df['ds'].min()} to {Y_df['ds'].max()}")

from neuralforecast import NeuralForecast
from neuralforecast.models import NHITS

horizon = 12

# Try different hyperparmeters to improve accuracy.
models = [NHITS(h=horizon,                      # Forecast horizon
                input_size=2 * horizon,         # Length of input sequence
                max_steps=10,                   # Number of steps to train (reduced for quick debugging)
                n_freq_downsample=[2, 1, 1],    # Downsampling factors for each stack output
                mlp_units = 3 * [[1024, 1024]]) # Number of units in each block.
          ]
nf = NeuralForecast(models=models, freq='M')
nf.fit(df=Y_df, val_size=horizon)
print(f"Datestamp range of Y_df after nf.fit(): {Y_df['ds'].min()} to {Y_df['ds'].max()}")

Y_hat_insample = nf.predict_insample(step_size=horizon)

print(Y_hat_insample.head())
print(f"Datestamp range after predict_insample: {Y_hat_insample['ds'].min()} to {Y_hat_insample['ds'].max()}")
# print top 10 rows of Y_hat_insample for which y is not NaN/null
print(Y_hat_insample[Y_hat_insample['y'].notnull()].head(10))

import matplotlib.pyplot as plt

plt.figure(figsize=(10, 5))
plt.plot(Y_hat_insample['ds'], Y_hat_insample['y'], label='True', linestyle='--', marker='o')
plt.plot(Y_hat_insample['ds'], Y_hat_insample['NHITS'], label='Forecast')
plt.axvline(Y_hat_insample['ds'].iloc[-12], color='black', linestyle='--', label='Train-Test Split')
plt.xlabel('Timestamp [t]')
plt.ylabel('Monthly Passengers')
plt.grid()
plt.legend()
plt.show(block=True)

Issue Severity

Low: It annoys or frustrates me.

cchallu commented 1 year ago

Hi @tg2k! The input dataframe is expected to be balanced: it has a complete set of observations (rows) between the first and last dates for each time series for the given frequency. Even with this workaround, the model will not be accurate, as it expects a complete input with information for all the dates.

The best solution is to balance your data beforehand, completing the missing rows. You can impute the missing data with an appropriate imputation method for your task. Alternatively, you can fill them with 0s and add the column available_mask in your dataframe, which contains 1 for available rows and 0s for the rows with missing data. Models will use the available_mask column to avoid using the missing rows (filled with 0s for the input and masked completely for the training loss).

Let me know if this helps!

eye4got commented 2 months ago

Following the second very helpful recommendation from @cchallu fixed the following exception and I really appreciate their help. But I think the expection could be more clearly communicated and perhaps some kind of check and warning could be fed back to the user during training? I found it frustrating that the error was so vague and occurred well after I had fit the model.

image

yenhochen commented 1 month ago

I am getting the same bug