feature-engine / feature_engine

Feature engineering package with sklearn like functionality
https://feature-engine.trainindata.com/
BSD 3-Clause "New" or "Revised" License
1.8k stars 303 forks source link

Create MatchDtypes transformer or functionality #645

Closed kylegilde closed 10 months ago

kylegilde commented 1 year ago

Is your feature request related to a problem? Please describe. I have encountered the problem where dtypes in the serving API do not match the dtypes that were used in training. This causes gaps in the way that the features are numerically represented in training and serving, which causes differences in model performance.

Describe the solution you'd like

I propose creating a class called MatchDtypes. It could inherit MatchVariables and have a fit attribute that stores the feature names and dtypes as a dictionary: feature_dtypesin. In its transform method, it would cast the dtypes: X.astype(self.feature_dtypesin)

Describe alternatives you've considered Or we could just add this functionality to MatchVariables itself. This probably makes the most sense. I think that "matching variables" should be conceptually defined as having the correct feature names and correct dtypes.

solegalli commented 1 year ago

Hey @kylegilde

Could you please paste an example that triggers the problem / error in question?

kylegilde commented 1 year ago

The goal of this feature is to ensure that the prediction and training dtypes match.

I have seen this problem when I have trained and saved a model pipeline and then had that model pipeline served in a production API. When that API calls the predict or transform method with some set of feature values, sometimes the dtypes of API call don't match the dtypes used in training the model.

Some examples:

  1. The API call contains datetime values that are string dtypes, but the model pipeline is expecting the dtype to be a Pandas datetime dtype.
  2. Some downstream user calls the API with some numbers that are saved as strings, e.g. "1234", but the model pipeline is expecting the dtype to be a float or integer.

I have always had to implement code that explicitly defines the correct dtypes and then call the astype method to fix these dtype differences.

Currently, I don't know of any scikit-learn compatible transformer that will memorize the training dtypes and update these dtypes at the time of prediction. Hence, I think that our users would benefit from having a class that ensures the same dtypes are used in training and prediction.

What do you think? Should MatchVariables also ensure that the dtypes match or should that be a new class?

kylegilde commented 1 year ago

Hi @solegalli , what do you think of adding matching dtypes to MatchVariables

kylegilde commented 1 year ago

I would love to add this functionality to the class and the unit tests.

solegalli commented 1 year ago

Hi @kylegilde

Sorry for the delay. I was on holidays the last 10 days.

I think the easiest is to extend MatchVariables. We could add a parameter, match_dtypes=False which returns the current functionality, but when set to True it does what you say.

Go for it.

Thank you!

kylegilde commented 1 year ago

sounds good, thanks!