Closed kylegilde closed 10 months ago
Merging #669 (7756785) into main (bd4c8fc) will increase coverage by
0.03%
. The diff coverage is100.00%
.
@@ Coverage Diff @@
## main #669 +/- ##
==========================================
+ Coverage 97.99% 98.03% +0.03%
==========================================
Files 100 100
Lines 3849 3861 +12
Branches 752 758 +6
==========================================
+ Hits 3772 3785 +13
Misses 28 28
+ Partials 49 48 -1
Files Changed | Coverage Δ | |
---|---|---|
feature_engine/preprocessing/match_columns.py | 100.00% <100.00%> (+2.08%) |
:arrow_up: |
:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more
@solegalli , What do you think?
@solegalli, Any thoughts or suggestions?
Hey @thibaultbl ! We are introducing some minor changes to this class, which, if I remember correctly, you created some time ago.
Looping you in. It would be great to have your feedback if you have a few minutes!
Thanks a lot!
Hey @kylegilde
The implementation looks good to me. Thank you for that.
And super apologies for the delayed response. I am going on holidays next Thursday, and I kind of needed to get a lot of work done before that date, so I was a bit pressed with things to get done.
Coming back to this PR; I think the main thing would be to expand the new test you added to cover more scenarios of recasting to make sure that it does not accidentally incur errors. I added a few scenarios in a separate comment. If you can think of any others, please add them.
If we don't have a test to check that the init raises an error if the new parameter is not a boolean, we need to add that one too.
In addition, it might be good to add a line to the docstrings under the init, saying that it has the option to match dtypes as well.
And last but not least, it would be great if you could add a code demo in this file: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/preprocessing/MatchVariables.rst so that users can see how to use this transformer.
Thanks a lot for the suggestion and the implementation. I think overall these are minor changes, and then we are good to go!
Well done!
I think that I added the necessary docstring in the new line 72 and the bool check in the new line 171. I can add the demo.
Hey @kylegilde
Any chance we could have the tests and demo by Thursday morning?
I am releasing a new version that fixes some bugs and adds some new functionality. it would be great if we can make this one part of the release.
I know it's short notice and I took ages to review. But I thought I'd try :p
Hey @kylegilde
Any chance we could have the tests and demo by Thursday morning?
I am releasing a new version that fixes some bugs and adds some new functionality. it would be great if we can make this one part of the release.
I know it's short notice and I took ages to review. But I thought I'd try :p
It's unlikely at this point.
No worries. We release it in the next round! Thank you for your reply!
@solegalli I think this is good to go. Let me know! Thanks
I accidentally closed it.
Hmmm, the failures are unrelated to my code.
Hmmm, the failures are unrelated to my code.
I have same problem in my PR https://github.com/feature-engine/feature_engine/pull/679, it's there some problem with the rare label encoder.
I have tried it locally with python3.9 and I got no failed tests.
Posting the failed test here for convenience:
______________________ test_correctly_ignores_nan_in_fit _______________________
df_enc_big = var_A ... var_C
0 A ... A
1 A ... A
2 A ... A
3 A ... A
4 A ... ... G ... G
36 G ... G
37 G ... G
38 G ... G
39 G ... G
[40 rows x 3 columns]
def test_correctly_ignores_nan_in_fit(df_enc_big):
df = df_enc_big.copy()
df.loc[df["var_C"] == "G", "var_C"] = np.nan
encoder = RareLabelEncoder(
tol=0.06,
n_categories=3,
missing_values="ignore",
)
encoder.fit(df)
# expected:
frequenc_cat = {
"var_A": ["B", "D", "A", "G", "C"],
"var_B": ["A", "D", "B", "G", "C"],
"var_C": ["C", "D", "B", "A"],
}
> assert encoder.encoder_dict_ == frequenc_cat
E AssertionError: assert {'var_A': ['B...C', 'B', 'A']} == {'var_A': ['B...D', 'B', 'A']}
E Omitting 2 identical items, use -vv to show
E Differing items:
E {'var_C': ['D', 'C', 'B', 'A']} != {'var_C': ['C', 'D', 'B', 'A']}
E Use -v to get more diff
tests/test_encoding/test_rare_label_encoder.py:141: AssertionError
___________ test_correctly_ignores_nan_in_fit_when_var_is_numerical ____________
df_enc_big = var_A ... var_C
0 A ... A
1 A ... A
2 A ... A
3 A ... A
4 A ... ... G ... G
36 G ... G
37 G ... G
38 G ... G
39 G ... G
[40 rows x 3 columns]
def test_correctly_ignores_nan_in_fit_when_var_is_numerical(df_enc_big):
df = df_enc_big.copy()
df["var_C"] = [
1,
1,
1,
1,
2,
2,
2,
2,
2,
2,
3,
3,
3,
3,
3,
3,
3,
3,
3,
3,
4,
4,
4,
4,
4,
4,
4,
4,
4,
4,
5,
5,
6,
6,
np.nan,
np.nan,
np.nan,
np.nan,
np.nan,
np.nan,
]
encoder = RareLabelEncoder(
tol=0.06,
n_categories=3,
missing_values="ignore",
ignore_format=True,
)
encoder.fit(df)
# expected:
frequenc_cat = {
"var_A": ["B", "D", "A", "G", "C"],
"var_B": ["A", "D", "B", "G", "C"],
"var_C": [3, 4, 2, 1],
}
> assert encoder.encoder_dict_ == frequenc_cat
E AssertionError: assert {'var_A': ['B....0, 2.0, 1.0]} == {'var_A': ['B... [3, 4, 2, 1]}
E Omitting 2 identical items, use -vv to show
E Differing items:
E {'var_C': [4.0, 3.0, 2.0, 1.0]} != {'var_C': [3, 4, 2, 1]}
E Use -v to get more diff
tests/test_encoding/test_rare_label_encoder.py:225: AssertionError
@solegalli,
I wonder if the test failure is related to the changes in https://github.com/feature-engine/feature_engine/pull/665
CCing @GiorgioSgl
@solegalli,
I think that test failure is related to the changes in https://github.com/feature-engine/feature_engine/pull/665. I don't see that this PR merge passed all of the circleci tests.
@solegalli,
I think that test failure is related to the changes in #665. I don't see that this PR merge passed all of the circleci tests.
I'll have a look
Hmm, I reran some code where I thought I had seen some data changing, but I wasn't able to reproduce that phenomenon, so I think we are ok.
Hi @kylegilde
Thank you so much for adding the demo and all the tests! The demo is very well intertwined with the existing docs!
I've got some small requests for the tests, but the thing I'd like to discuss is the issue of data changing when changing the data type. What's happening exactly?
Thanks a lot for your time!
accidentally closed the PR
@solegalli ,
My new feature is ready to go.
I have implemented all of your suggestions and responded to your questions. The CI failures seem related to files I did not change.
Hey @kylegilde
Thanks a lot for the patience till I fixed the failing tests.
I made a PR to your repo, with a rebase that should make the tests pass, plus some minor cosmetic changes to your files: https://github.com/kylegilde/feature_engine/pull/3/
If you merge over there, then it updates here and I can merge and close!
Thanks a lot for your time and patience!
all set!
Hey @kylegilde thank you for the quick turnaround.
I had a look at the PR I made on your repo, and it is still open, which means that my tiny changes were not incorporated here. Could you please merge in your repo?
Here is the link: https://github.com/kylegilde/feature_engine/pull/3/
Thank you @kylegilde !!!
The goal of this feature is to ensure that the prediction and training dtypes match.
I have seen this problem when I have trained and saved a model pipeline and then had that model pipeline served in a production API. When that API calls the predict or transform method with some set of feature values, sometimes the dtypes of API call don't match the dtypes used in training the model.
Some examples:
The API call contains datetime values that are string dtypes, but the model pipeline is expecting the dtype to be a Pandas datetime dtype.
Some downstream user calls the API with some numbers that are saved as strings, e.g. "1234", but the model pipeline is expecting the dtype to be a float or integer.