Open daniepi opened 7 months ago
Hi again,
I was digging into to the code. I think the problem arises from mdl_time_forecast
https://github.com/business-science/modeltime/blob/master/R/modeltime-forecast.R#L1034
The problem is that mld$blueprint$recipe
is a trained recipe as estimated on whatever is the first series in nested data
https://github.com/business-science/modeltime/blob/master/R/modeltime-forecast.R#L927-L928
Hence if any of the series does not share the same time index, processing steps that remove some features (like CORR, ZV) will create discrepancy between data used to train model for such series vs. data used to predict on. This seems to create problem for models like XGBoost, where given set of features is expected in predict time, but it receives different set.
Ok sorry haven't had time to dig into it. But yeah the logic there was that the recipe used on the first model can be used on others. Might need to rethink that
Hi @mdancho84, First and foremost thanks for this amazing suite of
modeltime
packages. I am trying to model many individual time series using nested forecasting as mentiond here: https://business-science.github.io/modeltime/articles/nested-forecasting.htmlI came across a peculiar problem with using a commonly defined recipe with date-based features and having timeseries of differing lengths and not fully overlapping periods.
With recipe like this
The training works well and models are fitted well on all time series. I see from the recipes nested in the output of
modeltime_nested_fit
that not all series where fitted with same features (I guess it is the ZV and CORR removal which decided to remove different features for different series) which is ok and wanted. Unfortunately, models for some series are lacking.calibration_data
, so I was trying to figure out why. What I have found out is that it works well for all series which end up with same features as the original recipe definition, while it fails producing.calibration_data
for all other series.Simple example. I have 8 series. I build the recipe as stated above with
extract_nested_train_split(nested_data_tbl)
which by default uses.row_id = 1
, i.e. first series. Let say series nr. 7 and 8 were trained with different feature sets (because their training period was slightly different to series 1-6). Then the calculation of.calibration_data
would fail.I can manualy produce
new_data
usingbake
andprep
using the recipe specifically extracted for series 7/8 and the predict(model, new_data = ...) and predictions work fine. e.g.Finally, when I create the initial recipe with
extract_nested_train_split(nested_data_tbl, .row_id = 7)
, then calibration fails for first 6 series and works for series 7.I don't know the implementation details well, but I think the problem is that when prediction data for calibration is being constructed, it bakes the recipe trained on the data supplied when recipe is being instantiated and not on the actual (individual time series) training data. Hence it tries to predict a model trained on a given feature set using new data with different feature set.
Is my understanding correct? Thanks for any feedback. :)