extend MatchVariables functionality to match dtypes as well

kylegilde commented 1 year ago

The goal of this feature is to ensure that the prediction and training dtypes match.

I have seen this problem when I have trained and saved a model pipeline and then had that model pipeline served in a production API. When that API calls the predict or transform method with some set of feature values, sometimes the dtypes of API call don't match the dtypes used in training the model.

Some examples:

The API call contains datetime values that are string dtypes, but the model pipeline is expecting the dtype to be a Pandas datetime dtype.
Some downstream user calls the API with some numbers that are saved as strings, e.g. "1234", but the model pipeline is expecting the dtype to be a float or integer.

codecov[bot] commented 1 year ago

Codecov Report

Merging #669 (7756785) into main (bd4c8fc) will increase coverage by 0.03%. The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #669      +/-   ##
==========================================
+ Coverage   97.99%   98.03%   +0.03%     
==========================================
  Files         100      100              
  Lines        3849     3861      +12     
  Branches      752      758       +6     
==========================================
+ Hits         3772     3785      +13     
  Misses         28       28              
+ Partials       49       48       -1

Files Changed	Coverage Δ
feature_engine/preprocessing/match_columns.py	`100.00% <100.00%> (+2.08%)`	:arrow_up:

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

kylegilde commented 1 year ago

@solegalli , What do you think?

kylegilde commented 1 year ago

@solegalli, Any thoughts or suggestions?

solegalli commented 1 year ago

Hey @thibaultbl ! We are introducing some minor changes to this class, which, if I remember correctly, you created some time ago.

Looping you in. It would be great to have your feedback if you have a few minutes!

Thanks a lot!

kylegilde commented 1 year ago

Hey @kylegilde

The implementation looks good to me. Thank you for that.

And super apologies for the delayed response. I am going on holidays next Thursday, and I kind of needed to get a lot of work done before that date, so I was a bit pressed with things to get done.

Coming back to this PR; I think the main thing would be to expand the new test you added to cover more scenarios of recasting to make sure that it does not accidentally incur errors. I added a few scenarios in a separate comment. If you can think of any others, please add them.

If we don't have a test to check that the init raises an error if the new parameter is not a boolean, we need to add that one too.

In addition, it might be good to add a line to the docstrings under the init, saying that it has the option to match dtypes as well.

And last but not least, it would be great if you could add a code demo in this file: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/preprocessing/MatchVariables.rst so that users can see how to use this transformer.

Thanks a lot for the suggestion and the implementation. I think overall these are minor changes, and then we are good to go!

Well done!

I think that I added the necessary docstring in the new line 72 and the bool check in the new line 171. I can add the demo.

solegalli commented 1 year ago

Hey @kylegilde

Any chance we could have the tests and demo by Thursday morning?

I am releasing a new version that fixes some bugs and adds some new functionality. it would be great if we can make this one part of the release.

I know it's short notice and I took ages to review. But I thought I'd try :p

kylegilde commented 1 year ago

Hey @kylegilde

Any chance we could have the tests and demo by Thursday morning?

I am releasing a new version that fixes some bugs and adds some new functionality. it would be great if we can make this one part of the release.

I know it's short notice and I took ages to review. But I thought I'd try :p

It's unlikely at this point.

solegalli commented 1 year ago

No worries. We release it in the next round! Thank you for your reply!

kylegilde commented 1 year ago

@solegalli I think this is good to go. Let me know! Thanks

kylegilde commented 1 year ago

I accidentally closed it.

kylegilde commented 1 year ago

Hmmm, the failures are unrelated to my code.

GiorgioSgl commented 1 year ago

Hmmm, the failures are unrelated to my code.

I have same problem in my PR https://github.com/feature-engine/feature_engine/pull/679, it's there some problem with the rare label encoder.

I have tried it locally with python3.9 and I got no failed tests.

kylegilde commented 1 year ago

Posting the failed test here for convenience:



______________________ test_correctly_ignores_nan_in_fit _______________________

df_enc_big =    var_A  ... var_C
0      A  ...     A
1      A  ...     A
2      A  ...     A
3      A  ...     A
4      A  ...     ...    G  ...     G
36     G  ...     G
37     G  ...     G
38     G  ...     G
39     G  ...     G

[40 rows x 3 columns]

    def test_correctly_ignores_nan_in_fit(df_enc_big):

        df = df_enc_big.copy()
        df.loc[df["var_C"] == "G", "var_C"] = np.nan

        encoder = RareLabelEncoder(
            tol=0.06,
            n_categories=3,
            missing_values="ignore",
        )
        encoder.fit(df)

        # expected:
        frequenc_cat = {
            "var_A": ["B", "D", "A", "G", "C"],
            "var_B": ["A", "D", "B", "G", "C"],
            "var_C": ["C", "D", "B", "A"],
        }
>       assert encoder.encoder_dict_ == frequenc_cat
E       AssertionError: assert {'var_A': ['B...C', 'B', 'A']} == {'var_A': ['B...D', 'B', 'A']}
E         Omitting 2 identical items, use -vv to show
E         Differing items:
E         {'var_C': ['D', 'C', 'B', 'A']} != {'var_C': ['C', 'D', 'B', 'A']}
E         Use -v to get more diff

tests/test_encoding/test_rare_label_encoder.py:141: AssertionError
___________ test_correctly_ignores_nan_in_fit_when_var_is_numerical ____________

df_enc_big =    var_A  ... var_C
0      A  ...     A
1      A  ...     A
2      A  ...     A
3      A  ...     A
4      A  ...     ...    G  ...     G
36     G  ...     G
37     G  ...     G
38     G  ...     G
39     G  ...     G

[40 rows x 3 columns]

    def test_correctly_ignores_nan_in_fit_when_var_is_numerical(df_enc_big):

        df = df_enc_big.copy()
        df["var_C"] = [
            1,
            1,
            1,
            1,
            2,
            2,
            2,
            2,
            2,
            2,
            3,
            3,
            3,
            3,
            3,
            3,
            3,
            3,
            3,
            3,
            4,
            4,
            4,
            4,
            4,
            4,
            4,
            4,
            4,
            4,
            5,
            5,
            6,
            6,
            np.nan,
            np.nan,
            np.nan,
            np.nan,
            np.nan,
            np.nan,
        ]

        encoder = RareLabelEncoder(
            tol=0.06,
            n_categories=3,
            missing_values="ignore",
            ignore_format=True,
        )
        encoder.fit(df)

        # expected:
        frequenc_cat = {
            "var_A": ["B", "D", "A", "G", "C"],
            "var_B": ["A", "D", "B", "G", "C"],
            "var_C": [3, 4, 2, 1],
        }
>       assert encoder.encoder_dict_ == frequenc_cat
E       AssertionError: assert {'var_A': ['B....0, 2.0, 1.0]} == {'var_A': ['B... [3, 4, 2, 1]}
E         Omitting 2 identical items, use -vv to show
E         Differing items:
E         {'var_C': [4.0, 3.0, 2.0, 1.0]} != {'var_C': [3, 4, 2, 1]}
E         Use -v to get more diff

tests/test_encoding/test_rare_label_encoder.py:225: AssertionError

kylegilde commented 1 year ago

@solegalli,

I wonder if the test failure is related to the changes in https://github.com/feature-engine/feature_engine/pull/665

CCing @GiorgioSgl

kylegilde commented 1 year ago

@solegalli,

I think that test failure is related to the changes in https://github.com/feature-engine/feature_engine/pull/665. I don't see that this PR merge passed all of the circleci tests.

solegalli commented 1 year ago

@solegalli,

I think that test failure is related to the changes in #665. I don't see that this PR merge passed all of the circleci tests.

I'll have a look

kylegilde commented 11 months ago

Hmm, I reran some code where I thought I had seen some data changing, but I wasn't able to reproduce that phenomenon, so I think we are ok.

Hi @kylegilde

Thank you so much for adding the demo and all the tests! The demo is very well intertwined with the existing docs!

I've got some small requests for the tests, but the thing I'd like to discuss is the issue of data changing when changing the data type. What's happening exactly?

Thanks a lot for your time!

kylegilde commented 11 months ago

accidentally closed the PR

kylegilde commented 11 months ago

@solegalli ,

My new feature is ready to go.

I have implemented all of your suggestions and responded to your questions. The CI failures seem related to files I did not change.

solegalli commented 10 months ago

Hey @kylegilde

Thanks a lot for the patience till I fixed the failing tests.

I made a PR to your repo, with a rebase that should make the tests pass, plus some minor cosmetic changes to your files: https://github.com/kylegilde/feature_engine/pull/3/

If you merge over there, then it updates here and I can merge and close!

Thanks a lot for your time and patience!

kylegilde commented 10 months ago

all set!

solegalli commented 10 months ago

Hey @kylegilde thank you for the quick turnaround.

I had a look at the PR I made on your repo, and it is still open, which means that my tiny changes were not incorporated here. Could you please merge in your repo?

Here is the link: https://github.com/kylegilde/feature_engine/pull/3/

solegalli commented 10 months ago

Thank you @kylegilde !!!

feature-engine / feature_engine

extend MatchVariables functionality to match dtypes as well #669

Codecov Report