Closed 93lorenzo closed 1 month ago
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 98.19%. Comparing base (
ac72f9d
) to head (0fd27ee
). Report is 3 commits behind head on main.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Hey @93lorenzo Sorry for the delayed response.
I made a PR to your repo with some changes: https://github.com/93lorenzo/feature_engine/pull/1
It turns out, that this is harder than I thought.
At the moment, this is what I think could work:
if encode=raise, we just let the ordinal encoder raise the error. if encode is not raise, we let the encoder do whatever (i set it to encode so it will return a value for unseen categories and then the tree will make a prediction at the back of that, but we don;t care)
In transform, when encode is not raise, we need to check if there is a category in the dataframe, that was not present in the training set (during fit).
We have the unique categories for each variable in the encodingdict parameter of the ordinalencoder.
What we need to do is to find a fast way to find categories in the new dataframe that are not among the keys of the corresponding variable's dictionary. We could use set(a).intersection(b) where a is the variables in the encodingdict and b is the unique categories seen in the variables in the new df. And if there are unseen, then we create a boolean vector with True where the observation is unseen.
We do this for each variable, so we will end up with an array of Trues and Falses with the same size of the dataframe and the variables that are being encoded. And then using pandas loc we set the Trues to fill_value.
I don't think pandas.isin and apply are the fastest options.
@glevv would you know of a better way? and by better I mean faster.
@93lorenzo Would you be able to pick up from where I left and try to modify the logic in transform to create the mask array and then replace by fill_value?
Looking at it again, the changes needed are instead of calling the list of unique values that you created in fit, we'd use the one that we already have within OrdinalEncoder. And instead of using apply and isin, we'd try to use sets and difference and numpy
Change of plans. We are tackling issue #588 together with this one, which makes our lives much easier.
superseeded by #757
closes #728 closes #588
Exposing the
unseen
parameter for theDecisionTreeEncoder
. Changing the encoder_ pipeline for an encodingdict that maps from categories to numerical mappings (the predictions of the tree)