Ignore NaNs before using `OrdinalEncoder`

feature-engine / feature_engine

Feature engineering package with sklearn like functionality

https://feature-engine.trainindata.com/

BSD 3-Clause "New" or "Revised" License

1.9k stars 311 forks source link

Ignore NaNs before using `OrdinalEncoder` #493

Closed datacubeR closed 1 year ago

datacubeR commented 2 years ago

Is your feature request related to a problem? OrdinalEncoder should accept nulls. Sometimes you don't want to impute directly but using Imputing Options of XGBoost, LightGBM or CatBoost. Because of the constraints of always Impute before this is currently not possible.

Describe the solution you'd like I like that Feature Engine forces you to Impute first, but I will add some kind of default flag ignore_nan=False in case we want to use other imputation afterwards.

Hope you find this helpful.

solegalli commented 2 years ago

Hi @datacubeR

Thanks for the suggestion.

You are certainly not the first one who'd like the encoders to support NaN. I think @david-cortes made a similar suggestion here #481, am I right?

It would be great if those interested in this functionality could upvote / like or leave a comment in any of the 2 issues to better gauge the interest in this functionality.

solegalli commented 1 year ago

Hi @glevv

We've got a few requests to allow feature-engine encoders to not raise an error when the variable has nan.

At the moment, the encoders are designed to require imputation before encoding.

I think, the idea is to let the encoders encode variables if they have nan and leaving the nan as nan. The motivation is that some algorithms, like lightgbm (not sure which else?), can handle nan out of the box.

What do you think about this? would this be useful only for lightgbms? something else? if just lightgbm, is this worth the effort?

And would you be happy to pick this up ?

I think we should add a param in the init, handle_missing, defaulting to "raise" not to break backwards compatibility, but which the users could change to ignore to leave nan as nan.

glevv commented 1 year ago

Hello @solegalli ! I think XGBoost, LightGBM, CatBoost (simple thou) and HistGradientBoosting support inputs with nans. I think there were also some clustering algorithms that supports nans, but nothing more. It's more of a UX/convenience improvement, so if you have more important tasks on the roadmap, this could wait. But it's possible to extend functionality of handle_missing to support ignoring nans, +1 on not changing the default value. As for making a PR: I can try, but I'm not sure that I will have a lot of free time, since, well, end of the year crunch and all that.

P.S. I'll start working on it on the weekend

glevv commented 1 year ago

handle_missing could be implemented in the base class in transform function, but RareLabelEncoder, OneHotEncoder and DecisionTreeEncoder redefine transform function, so they will need to be updated manually. StringSimilarityEncoder already has this functionality, will need just harmonization of names.

solegalli commented 1 year ago

Maybe it is worth exploring creating a MixIn?

glevv commented 1 year ago

I'm not sure how it would help. My current understanding is that we need to add additional lines to fit and transform methods that will ignore nans and assign nans according to input, but not all encoders use base class methods. On top of that logic of some encoders just don't allow nans, like DecisionTreeEncoder.