Nixtla / mlforecast

Scalable machine 🤖 learning for time series forecasting.
https://nixtlaverse.nixtla.io/mlforecast
Apache License 2.0
870 stars 86 forks source link

In predict Function, Found missing inputs in X_df. It should have one row per id and date for the complete forecasting horizon #242

Closed SyedKumailHussainNaqvi closed 11 months ago

SyedKumailHussainNaqvi commented 12 months ago

i am training my dataset on XGBoost model for time series Predication, but after fitting model on train dataset but predict function not predict on validation dataset,kindly guide me how can i solve this?

Dataset Info Active_Power | Current_Phase_Average | Weather_Temperature_Celsius | Weather_Relative_Humidity | Global_Horizontal_Radiation | Diffuse_Horizontal_Radiation | Wind_Speed | Wind_Direction

102.175270 | 142.188522 | 24.477514 | 23.652782 | 498.144226 | 46.486958 | 3.221892 | 205.768753 105.421097 | 147.543472 | 24.961168 | 23.067873 | 514.654907 | 45.322678 | 3.747602 | 122.779907 108.409271 | 152.497131 | 25.137936 | 22.755598 | 536.330322 | 49.347523 | 3.553297 | 157.523910 111.148140 | 157.164398 | 25.441204 | 22.623100 | 548.361572 | 46.074684 | 2.969267 | 109.330276 113.915314 | 161.903259 | 25.804924 | 22.194019 | 553.215027 | 45.705791 | 3.401344 | 142.917725 0.000000 | 6.104828 | 0.000000 | 4.234369 | 0.377419 | 0.425231 | 0.450622 | 0.000000 0.000000 | 6.103529 | 0.000000 | 4.234369 | 0.377419 | 0.425231 | 0.450622 | 0.000000 0.000000 | 6.100151 | 0.000000 | 4.234369 | 0.377419 | 0.425231 | 0.450622 | 0.000000 0.000000 | 6.101910 | 0.000000 | 4.234369 | 0.377419 | 0.425231 | 0.450622 | 0.000000 0.000000 | 6.096731 | 0.000000 | 4.234369 | 0.377419 | 0.425231 | 0.450622 | 0.000000

code for training and predication data = df2.reset_index()[['timestamp', 'Active_Power', 'Current_Phase_Average', 'Weather_Temperature_Celsius' ,'Weather_Relative_Humidity' ,'Global_Horizontal_Radiation', 'Diffuse_Horizontal_Radiation' ,'Wind_Speed' ,'Wind_Direction']] data.index = pd.Index(np.repeat(0, data.shape[0]), name='unique_id') data.reset_index(inplace=True) df = data.sort_values(['unique_id', 'timestamp']).groupby('unique_id',as_index=False).apply(lambda x: x.fillna(method='ffill')) train = df.loc[df['timestamp'] < '2016-06-01'] valid = df.loc[(df['timestamp'] >= '2016-06-01') & (df['timestamp'] < '2016-06-30')] models = [XGBRegressor(random_state=0, n_estimators=100)] model = MLForecast(models=models, freq='5T', lags=[12], lag_transforms={ 1: [(rolling_mean, 12), (rolling_max, 12), (rolling_min, 12)], }, date_features=['dayofweek', 'month'], num_threads=6) model.fit(train, id_col='unique_id', time_col='timestamp', target_col='Active_Power',fitted=True, static_features=[]) p = model.predict(horizon=5,X_df=valid)

ValueError Traceback (most recent call last)

584     ts = self.ts

--> 586 forecasts = ts.predict( 587 models=self.models_, 588 horizon=h, 589 dynamic_dfs=dynamic_dfs, 590 before_predict_callback=before_predict_callback, 591 after_predict_callback=after_predict_callback, 592 X_df=X_df, 593 ids=ids, 594 ) 595 if level is not None: 596 if self._cs_df is None: ... 601 columns=[self.id_col, self.time_col, "_start", "_end"] 602 ) 603 if getattr(self, "max_horizon", None) is None:

ValueError: Found missing inputs in X_df. It should have one row per id and date for the complete forecasting horizon

jmoralez commented 12 months ago

Hey @SyedKumailHussainNaqvi. Is the frequency of your series 5 minutes?

SyedKumailHussainNaqvi commented 12 months ago
@jmoralez Yes, my data series frequency is 5 minutes, kindly check this.. timestamp Active_Power Current_Phase_Average Weather_Temperature_Celsius Weather_Relative_Humidity Global_Horizontal_Radiation Diffuse_Horizontal_Radiation Wind_Speed Wind_Direction
4/1/2016 8:55 102.1753 142.1885 24.47751 23.65278 498.1442 46.48696 3.221892 205.7688
4/1/2016 9:00 105.4211 147.5435 24.96117 23.06787 514.6549 45.32268 3.747602 122.7799
4/1/2016 9:05 108.4093 152.4971 25.13794 22.7556 536.3303 49.34752 3.553297 157.5239
4/1/2016 9:10 111.1481 157.1644 25.4412 22.6231 548.3616 46.07468 2.969267 109.3303
4/1/2016 9:15 113.9153 161.9033 25.80492 22.19402 553.215 45.70579 3.401344 142.9177
4/1/2016 9:20 116.4639 165.9414 25.93542 21.78914 568.1932 48.49394 3.898342 167.0238
4/1/2016 9:25 119.0665 170.5314 25.80478 21.97729 587.751 50.5708 3.913895 128.7222
4/1/2016 9:30 121.7267 175.1911 26.32808 21.42587 604.8317 49.5583 4.04404 90.76111
4/1/2016 9:35 124.4063 179.5768 26.7749 21.0956 621.0855 48.98398 2.631776 194.7342
4/1/2016 9:40 126.924 183.7551 27.03691 20.37064 637.8745 49.20296 3.320363 67.64616
4/1/2016 9:45 129.3882 187.7454 27.38554 19.96036 656.485 51.33382 3.324487 134.8194
4/1/2016 9:50 131.626 191.4583 27.71058 19.74454 671.711 50.71244 2.800914 144.7963
4/1/2016 9:55 133.8582 195.2888 28.35464 19.16202 685.3091 48.57825 2.220385 121.1364
4/1/2016 10:00 136.06 198.774 28.39381 18.71133 702.8945 51.30513 3.279328 132.584
4/1/2016 10:05 137.8313 201.848 28.58586 18.68137 716.3569 51.98991 2.730431 93.9258
4/1/2016 10:10 139.678 204.9987 29.00494 18.10191 730.3823 52.61768 2.785331 124.0541
4/1/2016 10:15 141.4433 207.9873 28.79097 18.06579 740.2993 54.5131 2.611753 118.2188
4/1/2016 10:20 142.9773 210.5444 28.58105 18.09788 755.9532 57.29346 2.954512 168.503
4/1/2016 10:25 144.6015 213.3357 29.21808 17.5524 767.1425 55.70409 2.355037 125.6492
4/1/2016 10:30 145.9677 215.7586 29.67789 16.83503 781.8422 57.83193 2.358226 142.6517
4/1/2016 10:35 147.2709 218.0466 29.55465 16.96823 796.9732 61.30234 3.018474 146.7362
4/1/2016 10:40 148.7956 220.4369 29.61863 16.85245 811.0737 62.71655 2.493661 133.6768
4/1/2016 10:45 149.5706 222.0094 29.78173 17.16988 821.0898 63.57846 1.97588 264.2258
4/1/2016 10:50 151.0038 224.2357 29.65329 17.13947 829.7411 59.55241 1.933785 218.3842
4/1/2016 10:55 152.1304 226.2266 29.97485 16.63404 838.7658 57.79825 2.962153 196.2876
4/1/2016 11:00 153.6104 228.596 29.96729 16.53452 849.1298 56.9832 2.532594 66.07544
4/1/2016 11:05 153.7445 229.2958 30.40237 16.02146 854.5871 53.07868 1.374275 226.616
4/1/2016 11:10 154.7077 231.1436 30.74542 15.66695 868.4718 55.00143 2.591314 231.6353
4/1/2016 11:15 155.9508 233.1486 30.60402 15.48819 884.5474 56.15897 2.011981 177.2232
jmoralez commented 12 months ago

What that's doing is verifying that it gets the expected ids and dates in X_df. You can replicate the check by using:

dates_validation = pd.DataFrame({
    model.ts.id_col: model.ts.uids,
    "_start": model.ts.last_dates + model.ts.freq,
    "_end": model.ts.last_dates + horizon * model.ts.freq,
})

Can you verify if those dates and ids match the ones you're providing through valid?

SyedKumailHussainNaqvi commented 12 months ago

@jmoralez Thank you so much for your kind guidance... i run the above code its output is below:

| unique_id | _start | _end 1.0 | 2016-05-15 18:35:00 | 2016-05-15 18:05:00

but my Data Series start from 06:55 and end at 18:30. then what should i do now?

jmoralez commented 12 months ago

Those dates are built based on the last time it saw during training. Do you have missing timestamps?

SyedKumailHussainNaqvi commented 12 months ago

this is my model and fit code kindly review this. models = [XGBRegressor(random_state=0, n_estimators=100)] model = MLForecast(models=models, freq='5T', num_threads=6)

model.fit(train, id_col='unique_id', time_col='ds', target_col='y',fitted=True, static_features=[])

and train data series is this

ds y Current_Phase_Average Weather_Temperature_Celsius Weather_Relative_Humidity Global_Horizontal_Radiation Diffuse_Horizontal_Radiation Wind_Speed Wind_Direction unique_id
2016-04-01 08:55:00 102.175270 142.188522 24.477514 23.652782 498.144226 46.486958 3.221892 205.768753 1.0
2016-04-01 09:00:00 105.421097 147.543472 24.961168 23.067873 514.654907 45.322678 3.747602 122.779907 1.0
2016-04-01 09:05:00 108.409271 152.497131 25.137936 22.755598 536.330322 49.347523 3.553297 157.523910 1.0
2016-04-01 09:10:00 111.148140 157.164398 25.441204 22.623100 548.361572 46.074684 2.969267 109.330276 1.0
2016-04-01 09:15:00 113.915314 161.903259 25.804924 22.194019 553.215027 45.705791 3.401344 142.917725 1.0

....... ....... ...... 2016-05-15 18:10:00 | 0.000000 | 7.203154 | 23.014162 | 24.857141 | 6.644276 | 4.979178 | 1.898691 | 135.060287 | 1.0 2016-05-15 18:15:00 | 0.000000 | 6.224843 | 22.699066 | 25.454231 | 5.833419 | 4.189736 | 1.728728 | 121.308731 | 1.0 2016-05-15 18:20:00 | 0.000000 | 6.142416 | 22.338396 | 26.381437 | 4.928545 | 3.340665 | 1.659863 | 118.639999 | 1.0 2016-05-15 18:25:00 | 0.000000 | 6.142416 | 22.088547 | 26.944998 | 4.767952 | 3.199787 | 1.453017 | 120.521439 | 1.0 2016-05-15 18:30:00 | 0.000000 | 6.142416 | 21.587543 | 28.438608 | 5.341975 | 3.628179 | 1.357174 | 113.915932 | 1.0

jmoralez commented 12 months ago

So valid should start at "2016-05-15 18:35:00", which is what the check is verifying, why does yours start at a different timestamp?

SyedKumailHussainNaqvi commented 12 months ago

This is PV Dataset and i am forecasting the Ultra-short-term photovoltaic power prediction after one hour and the frequency of the dataset is 5 minutes, because the power output of the photovoltaic modules is significantly lower in the morning and evening, that is, it is 0 or close to 0 most of the time. Therefore, only the power between 6:55 and 18:30 is considered. that why in valid dataset start at "2016-05-16 06:55:00" next day.

SyedKumailHussainNaqvi commented 12 months ago

I change the train dataset split and now its end at "2016-05-15 18:00:00", and the is dates validation is below unique_id | _start | _end 1.0 | 2016-05-15 18:05:00 | 2016-05-15 18:25:00 now predict function execute successfully but only predict the next 5 values of validation dataset as below unique_id | ds | XGBRegressor | y 1.0 | 2016-05-15 18:05:00 | 0.353545 | 0.0 1.0 | 2016-05-15 18:10:00 | 0.159547 | 0.0 1.0 | 2016-05-15 18:15:00 | 0.407333 | 0.0 1.0 | 2016-05-15 18:20:00 | 0.192743 | 0.0 1.0 | 2016-05-15 18:25:00 | 0.492636 | 0.0 but my validation dataset length is 3646 rows × 10 columns, kindly guide me how can i predict the remaining values of valid dataset?

jmoralez commented 12 months ago

The number of predictions is controlled by the h argument of MLForecast.predict. If your dates are successive you should be able to just use a bigger h, e.g. h=3646.

SyedKumailHussainNaqvi commented 12 months ago

when i set h=3646 same ValueError "Found missing inputs in X_df. It should have one row per id and date for the complete forecasting horizonFound missing inputs in X_df. It should have one row per id and date for the complete forecasting horizon" error also come. Please guide me how can i predict model on valid dataset?

jmoralez commented 12 months ago

Are you including the missing timestamps in the training set? If you're not you can just use integer timestamps instead, e.g.

data['timestamp'] = data.sort_values(['unique_id', 'timestmap']).groupby('unique_id').cumcount()
model = MLForecast(models=models,
freq=1,  # this will advance each timestamp by 1 when predicting
lags=[12],
lag_transforms={
1: [(rolling_mean, 12), (rolling_max, 12), (rolling_min, 12)],
},
# date_features=['dayofweek', 'month'],  # you can't use date features anymore
num_threads=6)
SyedKumailHussainNaqvi commented 11 months ago

@jmoralez Thank you a lot for your kind response & guidance and so sorry for the late reply.