In predict Function, Found missing inputs in X_df. It should have one row per id and date for the complete forecasting horizon

SyedKumailHussainNaqvi commented 12 months ago

i am training my dataset on XGBoost model for time series Predication, but after fitting model on train dataset but predict function not predict on validation dataset,kindly guide me how can i solve this?

102.175270 | 142.188522 | 24.477514 | 23.652782 | 498.144226 | 46.486958 | 3.221892 | 205.768753 105.421097 | 147.543472 | 24.961168 | 23.067873 | 514.654907 | 45.322678 | 3.747602 | 122.779907 108.409271 | 152.497131 | 25.137936 | 22.755598 | 536.330322 | 49.347523 | 3.553297 | 157.523910 111.148140 | 157.164398 | 25.441204 | 22.623100 | 548.361572 | 46.074684 | 2.969267 | 109.330276 113.915314 | 161.903259 | 25.804924 | 22.194019 | 553.215027 | 45.705791 | 3.401344 | 142.917725 0.000000 | 6.104828 | 0.000000 | 4.234369 | 0.377419 | 0.425231 | 0.450622 | 0.000000 0.000000 | 6.103529 | 0.000000 | 4.234369 | 0.377419 | 0.425231 | 0.450622 | 0.000000 0.000000 | 6.100151 | 0.000000 | 4.234369 | 0.377419 | 0.425231 | 0.450622 | 0.000000 0.000000 | 6.101910 | 0.000000 | 4.234369 | 0.377419 | 0.425231 | 0.450622 | 0.000000 0.000000 | 6.096731 | 0.000000 | 4.234369 | 0.377419 | 0.425231 | 0.450622 | 0.000000

code for training and predication data = df2.reset_index()[['timestamp', 'Active_Power', 'Current_Phase_Average', 'Weather_Temperature_Celsius' ,'Weather_Relative_Humidity' ,'Global_Horizontal_Radiation', 'Diffuse_Horizontal_Radiation' ,'Wind_Speed' ,'Wind_Direction']] data.index = pd.Index(np.repeat(0, data.shape[0]), name='unique_id') data.reset_index(inplace=True) df = data.sort_values(['unique_id', 'timestamp']).groupby('unique_id',as_index=False).apply(lambda x: x.fillna(method='ffill')) train = df.loc[df['timestamp'] < '2016-06-01'] valid = df.loc[(df['timestamp'] >= '2016-06-01') & (df['timestamp'] < '2016-06-30')] models = [XGBRegressor(random_state=0, n_estimators=100)] model = MLForecast(models=models, freq='5T', lags=[12], lag_transforms={ 1: [(rolling_mean, 12), (rolling_max, 12), (rolling_min, 12)], }, date_features=['dayofweek', 'month'], num_threads=6) model.fit(train, id_col='unique_id', time_col='timestamp', target_col='Active_Power',fitted=True, static_features=[]) p = model.predict(horizon=5,X_df=valid)

ValueError Traceback (most recent call last)

584     ts = self.ts

--> 586 forecasts = ts.predict( 587 models=self.models_, 588 horizon=h, 589 dynamic_dfs=dynamic_dfs, 590 before_predict_callback=before_predict_callback, 591 after_predict_callback=after_predict_callback, 592 X_df=X_df, 593 ids=ids, 594 ) 595 if level is not None: 596 if self._cs_df is None: ... 601 columns=[self.id_col, self.time_col, "_start", "_end"] 602 ) 603 if getattr(self, "max_horizon", None) is None:

ValueError: Found missing inputs in X_df. It should have one row per id and date for the complete forecasting horizon

jmoralez commented 12 months ago

Hey @SyedKumailHussainNaqvi. Is the frequency of your series 5 minutes?

SyedKumailHussainNaqvi commented 12 months ago

@jmoralez Yes, my data series frequency is 5 minutes, kindly check this.. timestamp	Active_Power	Current_Phase_Average	Weather_Temperature_Celsius	Weather_Relative_Humidity	Global_Horizontal_Radiation	Diffuse_Horizontal_Radiation	Wind_Speed	Wind_Direction
4/1/2016 8:55	102.1753	142.1885	24.47751	23.65278	498.1442	46.48696	3.221892	205.7688
4/1/2016 9:00	105.4211	147.5435	24.96117	23.06787	514.6549	45.32268	3.747602	122.7799
4/1/2016 9:05	108.4093	152.4971	25.13794	22.7556	536.3303	49.34752	3.553297	157.5239
4/1/2016 9:10	111.1481	157.1644	25.4412	22.6231	548.3616	46.07468	2.969267	109.3303
4/1/2016 9:15	113.9153	161.9033	25.80492	22.19402	553.215	45.70579	3.401344	142.9177
4/1/2016 9:20	116.4639	165.9414	25.93542	21.78914	568.1932	48.49394	3.898342	167.0238
4/1/2016 9:25	119.0665	170.5314	25.80478	21.97729	587.751	50.5708	3.913895	128.7222
4/1/2016 9:30	121.7267	175.1911	26.32808	21.42587	604.8317	49.5583	4.04404	90.76111
4/1/2016 9:35	124.4063	179.5768	26.7749	21.0956	621.0855	48.98398	2.631776	194.7342
4/1/2016 9:40	126.924	183.7551	27.03691	20.37064	637.8745	49.20296	3.320363	67.64616
4/1/2016 9:45	129.3882	187.7454	27.38554	19.96036	656.485	51.33382	3.324487	134.8194
4/1/2016 9:50	131.626	191.4583	27.71058	19.74454	671.711	50.71244	2.800914	144.7963
4/1/2016 9:55	133.8582	195.2888	28.35464	19.16202	685.3091	48.57825	2.220385	121.1364
4/1/2016 10:00	136.06	198.774	28.39381	18.71133	702.8945	51.30513	3.279328	132.584
4/1/2016 10:05	137.8313	201.848	28.58586	18.68137	716.3569	51.98991	2.730431	93.9258
4/1/2016 10:10	139.678	204.9987	29.00494	18.10191	730.3823	52.61768	2.785331	124.0541
4/1/2016 10:15	141.4433	207.9873	28.79097	18.06579	740.2993	54.5131	2.611753	118.2188
4/1/2016 10:20	142.9773	210.5444	28.58105	18.09788	755.9532	57.29346	2.954512	168.503
4/1/2016 10:25	144.6015	213.3357	29.21808	17.5524	767.1425	55.70409	2.355037	125.6492
4/1/2016 10:30	145.9677	215.7586	29.67789	16.83503	781.8422	57.83193	2.358226	142.6517
4/1/2016 10:35	147.2709	218.0466	29.55465	16.96823	796.9732	61.30234	3.018474	146.7362
4/1/2016 10:40	148.7956	220.4369	29.61863	16.85245	811.0737	62.71655	2.493661	133.6768
4/1/2016 10:45	149.5706	222.0094	29.78173	17.16988	821.0898	63.57846	1.97588	264.2258
4/1/2016 10:50	151.0038	224.2357	29.65329	17.13947	829.7411	59.55241	1.933785	218.3842
4/1/2016 10:55	152.1304	226.2266	29.97485	16.63404	838.7658	57.79825	2.962153	196.2876
4/1/2016 11:00	153.6104	228.596	29.96729	16.53452	849.1298	56.9832	2.532594	66.07544
4/1/2016 11:05	153.7445	229.2958	30.40237	16.02146	854.5871	53.07868	1.374275	226.616
4/1/2016 11:10	154.7077	231.1436	30.74542	15.66695	868.4718	55.00143	2.591314	231.6353
4/1/2016 11:15	155.9508	233.1486	30.60402	15.48819	884.5474	56.15897	2.011981	177.2232

jmoralez commented 12 months ago

What that's doing is verifying that it gets the expected ids and dates in X_df. You can replicate the check by using:

dates_validation = pd.DataFrame({
    model.ts.id_col: model.ts.uids,
    "_start": model.ts.last_dates + model.ts.freq,
    "_end": model.ts.last_dates + horizon * model.ts.freq,
})

Can you verify if those dates and ids match the ones you're providing through valid?

SyedKumailHussainNaqvi commented 12 months ago

@jmoralez Thank you so much for your kind guidance... i run the above code its output is below:

| unique_id | _start | _end 1.0 | 2016-05-15 18:35:00 | 2016-05-15 18:05:00

but my Data Series start from 06:55 and end at 18:30. then what should i do now?

jmoralez commented 12 months ago

Those dates are built based on the last time it saw during training. Do you have missing timestamps?

SyedKumailHussainNaqvi commented 12 months ago

this is my model and fit code kindly review this. models = [XGBRegressor(random_state=0, n_estimators=100)] model = MLForecast(models=models, freq='5T', num_threads=6)

model.fit(train, id_col='unique_id', time_col='ds', target_col='y',fitted=True, static_features=[])

and train data series is this

ds	y	Current_Phase_Average	Weather_Temperature_Celsius	Weather_Relative_Humidity	Global_Horizontal_Radiation	Diffuse_Horizontal_Radiation	Wind_Speed	Wind_Direction	unique_id
2016-04-01 08:55:00	102.175270	142.188522	24.477514	23.652782	498.144226	46.486958	3.221892	205.768753	1.0
2016-04-01 09:00:00	105.421097	147.543472	24.961168	23.067873	514.654907	45.322678	3.747602	122.779907	1.0
2016-04-01 09:05:00	108.409271	152.497131	25.137936	22.755598	536.330322	49.347523	3.553297	157.523910	1.0
2016-04-01 09:10:00	111.148140	157.164398	25.441204	22.623100	548.361572	46.074684	2.969267	109.330276	1.0
2016-04-01 09:15:00	113.915314	161.903259	25.804924	22.194019	553.215027	45.705791	3.401344	142.917725	1.0

....... ....... ...... 2016-05-15 18:10:00 | 0.000000 | 7.203154 | 23.014162 | 24.857141 | 6.644276 | 4.979178 | 1.898691 | 135.060287 | 1.0 2016-05-15 18:15:00 | 0.000000 | 6.224843 | 22.699066 | 25.454231 | 5.833419 | 4.189736 | 1.728728 | 121.308731 | 1.0 2016-05-15 18:20:00 | 0.000000 | 6.142416 | 22.338396 | 26.381437 | 4.928545 | 3.340665 | 1.659863 | 118.639999 | 1.0 2016-05-15 18:25:00 | 0.000000 | 6.142416 | 22.088547 | 26.944998 | 4.767952 | 3.199787 | 1.453017 | 120.521439 | 1.0 2016-05-15 18:30:00 | 0.000000 | 6.142416 | 21.587543 | 28.438608 | 5.341975 | 3.628179 | 1.357174 | 113.915932 | 1.0

jmoralez commented 12 months ago

So valid should start at "2016-05-15 18:35:00", which is what the check is verifying, why does yours start at a different timestamp?

SyedKumailHussainNaqvi commented 12 months ago

This is PV Dataset and i am forecasting the Ultra-short-term photovoltaic power prediction after one hour and the frequency of the dataset is 5 minutes, because the power output of the photovoltaic modules is significantly lower in the morning and evening, that is, it is 0 or close to 0 most of the time. Therefore, only the power between 6:55 and 18:30 is considered. that why in valid dataset start at "2016-05-16 06:55:00" next day.

SyedKumailHussainNaqvi commented 12 months ago

I change the train dataset split and now its end at "2016-05-15 18:00:00", and the is dates validation is below unique_id | _start | _end 1.0 | 2016-05-15 18:05:00 | 2016-05-15 18:25:00 now predict function execute successfully but only predict the next 5 values of validation dataset as below unique_id | ds | XGBRegressor | y 1.0 | 2016-05-15 18:05:00 | 0.353545 | 0.0 1.0 | 2016-05-15 18:10:00 | 0.159547 | 0.0 1.0 | 2016-05-15 18:15:00 | 0.407333 | 0.0 1.0 | 2016-05-15 18:20:00 | 0.192743 | 0.0 1.0 | 2016-05-15 18:25:00 | 0.492636 | 0.0 but my validation dataset length is 3646 rows × 10 columns, kindly guide me how can i predict the remaining values of valid dataset?

jmoralez commented 12 months ago

The number of predictions is controlled by the h argument of MLForecast.predict. If your dates are successive you should be able to just use a bigger h, e.g. h=3646.

SyedKumailHussainNaqvi commented 12 months ago

when i set h=3646 same ValueError "Found missing inputs in X_df. It should have one row per id and date for the complete forecasting horizonFound missing inputs in X_df. It should have one row per id and date for the complete forecasting horizon" error also come. Please guide me how can i predict model on valid dataset?

jmoralez commented 12 months ago

Are you including the missing timestamps in the training set? If you're not you can just use integer timestamps instead, e.g.

data['timestamp'] = data.sort_values(['unique_id', 'timestmap']).groupby('unique_id').cumcount()
model = MLForecast(models=models,
freq=1,  # this will advance each timestamp by 1 when predicting
lags=[12],
lag_transforms={
1: [(rolling_mean, 12), (rolling_max, 12), (rolling_min, 12)],
},
# date_features=['dayofweek', 'month'],  # you can't use date features anymore
num_threads=6)

SyedKumailHussainNaqvi commented 11 months ago

@jmoralez Thank you a lot for your kind response & guidance and so sorry for the late reply.

Nixtla / mlforecast

In predict Function, Found missing inputs in X_df. It should have one row per id and date for the complete forecasting horizon #242