Nixtla / mlforecast

Scalable machine 🤖 learning for time series forecasting.
https://nixtlaverse.nixtla.io/mlforecast
Apache License 2.0
789 stars 74 forks source link

Found missing inputs in X_df. It should have one row per id and time for the complete forecasting horizon. #336

Closed aaron9980 closed 2 months ago

aaron9980 commented 2 months ago

What happened + What you expected to happen

Hi I'm trying to replicate the M5 forecast-eval notebook code but it gives the error message: ValueError: Found missing inputs in X_df. It should have one row per id and time for the complete forecasting horizon. You can get the expected structure by running MLForecast.make_future_dataframe(h) or get the missing combinatins in your current X_df by running MLForecast.get_missing_future(h, X_df). I ran the above functions and found that the X_df is correct and the get_missing_future() gives the wrong missing data as there are no missing data and the missing data are past dates. I did not change any of the code from the M5 forecast eval code so I'm confused at what went wrong.

Versions / Dependencies

mlforecast.version : 0.11.2 coreforecast vresion : 0.0.3 (Had to install this to avoid error when installing pip install -qqq "mlforecast[lag_transforms]") Python: 3.9.7 Windows OS (Running on Jupyter Notebook)

Reproduction script

%%time fcst.fit( long, id_col='id', time_col='date', target_col='y', static_features=['id', 'item_id', 'dept_id', 'cat_id', 'store_id', 'state_id'], )

%time preds = fcst.predict(28, X_df=X_df)

Issue Severity

High: It blocks me from completing my task.

jmoralez commented 2 months ago

Hey. I'm able to run the notebook end to end as-is. Are you also using kaggle? If you're not, where are you getting the data from?

get_missing_future() gives the wrong missing data as there are no missing data and the missing data are past dates

What that does is take the end dates for each id with something like: long.groupby('id')['date'].max() and apply the freq offset (from the constructor) to build the dates in the forecasting horizon, so if that's producing dates that are in the past it means that your long df doesn't have them.

aaron9980 commented 2 months ago

Hi yes I am using kaggle and the data from M5 competition. I ran it from end to end it gave me the same error. The only changes I made are installing coreforecast version 0.0.3 as it kept giving the error ( AttributeError: module 'coreforecast.lag_transforms' has no attribute 'BaseLagTransform') and removing the input path for all loading data functions as my files are located in the same directory as the notebook. I appreciate the reply and I'll look into what went wrong when creating the long data df.

aaron9980 commented 2 months ago

Hi according to your reply when get_missing_future() returns a date in a past, it means that long df does not have them. However here from my screenshots the long df contains the dates that get_missing_future() states are missing. Is there a reason for that? Further more the get_missing_future returns 796030 rows of missing data which to me is a lot considering X_df contains 853720 rows, same as what generates MLForecast.make_future_dataframe(h) . In fact the date and Id are the same, just that X_df contains other variables. Screenshot 2024-04-18 205428

aaron9980 commented 2 months ago

Update: I attempted to do forecasting on one of the 'valid' products and the forecasting work, therefore i think that the issue isnt compatability but instead the problem probably occurred when processing the data. I will attempt to find a fix

jmoralez commented 2 months ago

We perform a join with the expected and the X_df. Is it possible that the ids in long and X_df have a different type?

aaron9980 commented 2 months ago

Yep they have a different type. The id in long is category while the id in X_df is an object. The id in long has been a category data type since the start after sales is melted. Should the ID of X_df be converted to category? Update: I converted the id of X_df to category and attempted to join it with the expected future dataframe and could not join it completely. I am trying to find out why joining them on these keys are not working image

aaron9980 commented 2 months ago

When I wanted to view the expected output for an apparent missing days for the product HOBBIES_2_132_CA_1_evaluation, the expected future function returns a past date, however my long_df does have these dates. Is there a reason why this happens? I think this happens to most of the products that are being forecaste. Sorry for the overwhelming questions I appreciate your time. May I know what version are you using? Maybe if I switched to your version it would work.

image image

jmoralez commented 2 months ago

Are you able to share the notebook (either through kaggle or here)? I'm not able to reproduce the problem.

aaron9980 commented 2 months ago

m5-mlforecast-eval.zip Heres the ZIP with the Notebook instead. Currently I'm on mlforecas version 0.11.2, however I installed it using local file as pip install could not find the older version.

jmoralez commented 2 months ago

I was able to reproduce the issue locally, but it seems to be due to my version of pandas, I was on 1.5.3 and upgrading to 2.2.2 fixed it. Can you try that? I'll still investigate what the source of the problem is for that version.

aaron9980 commented 2 months ago

Hi, your answer solved my problem. I realised I had a way older version where my pandas had a version of 1.3.4. However after I updated it I had too much errors from other packages due to dependencies. Hence I reinstalled anaconda and it solved my issue. I'm curious why the issue occurred though. Anyways thanks for your help even though the issue was easy to fix.

jmoralez commented 2 months ago

Hey. This should be fixed by https://github.com/Nixtla/utilsforecast/pull/79, so you should be able to use pandas<2 with utilsforecast>=0.1.5.

I'm closing this, feel free to reopen if you encounter this issue again.

wmotkowska-inpost commented 1 month ago

Had the same error. Turned out that my weekly data was aggregated to Monday and the model with frequency set to "W" aggregates dates to Sunday, so the fitting input X dates and forecast horizon did not match. Changed my input dataframe to be aggregated to SUnday date and everything worked.