Do not re-calculate lags/transforms if the df already has them

huwf commented 8 months ago

Description

Hi, two things which would be useful for me if at all possible and don't seem to be in the code that I can tell.

When calling .fit or .preprocess, I'd like it so that if the column already exists for one of the lags/dates/lag_transforms it does not try to compute it. Then I can store those features at the same time as any others, and pass this data frame to .fit or .fit_models later.
More flexibility in naming generated or transformed columns as opposed to f"lag{lag}" or the output from_build_transform_name, maybe adding aCallable` somewhere perhaps? For example, the convention I have at my job is to call a column something like fieldname_lag_24, which is different to mlforecast. This is relatively easy for me to work around, but would be nice to have if it's easy to do

Use case

I'm working within an existing internal pipeline which has its own conventions.

We have different methods for collecting data from different sources, and this returns a dataframe and adds/cuts off data for lags and windows at a different place than where the model is trained. This makes pickling and using a preprocessed model difficult

jmoralez commented 8 months ago

Hey @huwf, thanks for using mlforecast.

About the first point, you specify them in the constructor so that they're computed when predicting, but you don't want to compute them in the feature engineering step, is that it?

For the second point, would providing a function with the same signature as _build_transform_name https://github.com/Nixtla/mlforecast/blob/af2173537420545570a94f95570de5c779be5614/mlforecast/core.py#L83 to the MLForecast constructor work for you?

huwf commented 8 months ago

Hi, thanks for the quick response

For the first, yes that's correct.

For the second, yes, that's exactly what I had in mind.

Possibly also the same for the straight lags too, instead of L190. Although I just thought, I guess I could do a noop function in lag_transforms to get the same effect here I guess rather than needing two naming functions?

jmoralez commented 8 months ago

First point makes sense, I think we can implement it.

For the second one as you can see a lag is actually just the mlforecast.core._identity function, so you could definitely do something like: lag_transforms={1: [_identity]} and if you've provided a function to rename them it'd work. I think we can implement that as well.

Nixtla / mlforecast

Do not re-calculate lags/transforms if the df already has them #260

Description

Use case