feature-engine / feature_engine

Feature engineering package with sklearn like functionality
https://feature-engine.trainindata.com/
BSD 3-Clause "New" or "Revised" License
1.79k stars 304 forks source link

refactor: exposing the unseen var of the categorical encoder #729

Closed 93lorenzo closed 1 month ago

93lorenzo commented 3 months ago

closes #728 closes #588

Exposing the unseen parameter for the DecisionTreeEncoder. Changing the encoder_ pipeline for an encodingdict that maps from categories to numerical mappings (the predictions of the tree)

codecov[bot] commented 3 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 98.19%. Comparing base (ac72f9d) to head (0fd27ee). Report is 3 commits behind head on main.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #729 +/- ## ======================================= Coverage 98.18% 98.19% ======================================= Files 105 105 Lines 4072 4093 +21 Branches 795 803 +8 ======================================= + Hits 3998 4019 +21 Misses 29 29 Partials 45 45 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

solegalli commented 1 month ago

Hey @93lorenzo Sorry for the delayed response.

I made a PR to your repo with some changes: https://github.com/93lorenzo/feature_engine/pull/1

It turns out, that this is harder than I thought.

At the moment, this is what I think could work:

if encode=raise, we just let the ordinal encoder raise the error. if encode is not raise, we let the encoder do whatever (i set it to encode so it will return a value for unseen categories and then the tree will make a prediction at the back of that, but we don;t care)

In transform, when encode is not raise, we need to check if there is a category in the dataframe, that was not present in the training set (during fit).

We have the unique categories for each variable in the encodingdict parameter of the ordinalencoder.

What we need to do is to find a fast way to find categories in the new dataframe that are not among the keys of the corresponding variable's dictionary. We could use set(a).intersection(b) where a is the variables in the encodingdict and b is the unique categories seen in the variables in the new df. And if there are unseen, then we create a boolean vector with True where the observation is unseen.

We do this for each variable, so we will end up with an array of Trues and Falses with the same size of the dataframe and the variables that are being encoded. And then using pandas loc we set the Trues to fill_value.

I don't think pandas.isin and apply are the fastest options.

@glevv would you know of a better way? and by better I mean faster.

solegalli commented 1 month ago

@93lorenzo Would you be able to pick up from where I left and try to modify the logic in transform to create the mask array and then replace by fill_value?

solegalli commented 1 month ago

Looking at it again, the changes needed are instead of calling the list of unique values that you created in fit, we'd use the one that we already have within OrdinalEncoder. And instead of using apply and isin, we'd try to use sets and difference and numpy

solegalli commented 1 month ago

Change of plans. We are tackling issue #588 together with this one, which makes our lives much easier.

solegalli commented 1 month ago

superseeded by #757