feature-engine / feature_engine

Feature engineering package with sklearn like functionality
https://feature-engine.trainindata.com/
BSD 3-Clause "New" or "Revised" License
1.79k stars 304 forks source link

Parametrize unseen for the OrdinalEncoder in DecisionTreeEncoder #728

Closed 93lorenzo closed 1 month ago

93lorenzo commented 4 months ago

I started to use the DecisionTreeEncoder for categorical variables. Lately I ran into an exception and I found out it was about when in there is an unseen category. I understand this as a default behaviour, but I was thinking that having the flexibility to decide that in the constructor can be helpful.

To give more context: From the code I saw that the OrdinalEncoder has by default the unseen variable set to "raise"

        cat_encoder = OrdinalEncoder(
            encoding_method=self.encoding_method,
            variables=variables_,
            missing_values="raise",
            ignore_format=self.ignore_format,
            unseen="raise",
        )

Suggestion

I would like to suggest to make that param something more flexible having it in the constructor


    def __init__(
        self,
        encoding_method: str = "arbitrary",
        cv=3,
        scoring: str = "neg_mean_squared_error",
        param_grid: Optional[dict] = None,
        regression: bool = True,
        random_state: Optional[int] = None,
        variables: Union[None, int, str, List[Union[str, int]]] = None,
        ignore_format: bool = False,
        # added var
        unseen: str = 'raise',
    ) -> None:

        super().__init__(variables, ignore_format)
        self.encoding_method = encoding_method
        self.cv = cv
        self.scoring = scoring
        self.regression = regression
        self.param_grid = param_grid
        self.random_state = random_state
        self.unseen = unseen

[ ... other code ...]

   cat_encoder = OrdinalEncoder(
            encoding_method=self.encoding_method,
            variables=variables_,
            missing_values="raise",
            ignore_format=self.ignore_format,
            unseen=self.unseen,
        )

Of course there are alternatives, in my project I am currently overriding the function to take into account that behaviour, but I think that this flexibility might help others . In case it is found a reasonable change I would be happy to help on making the change

glevv commented 3 months ago

Sounds reasonable to me.

I will check if it's doable and get back to you.

solegalli commented 3 months ago

I am not sure I understand the suggestion or the problem. At the moment, we offer options about what to do with unseen categories. What's wrong with that functionality? What else would the transformer need to do?

The init parameters need to be immutable to be compatible with the sklearn API, so we can't really add functionality there, we need to accept them as the user enters them.

Anyhow, I am sure I am missing something here, so would be great if you guys could enlighten me :)

glevv commented 3 months ago

DecisionTreeEncoder does not expose "unseen" argument to the user and instead uses constant value "raise". Proposition is to expose it to the user.

93lorenzo commented 3 months ago

First of all, thanks a lot for the comments! 🙏 I will try to explain more about why I raised this this issue.

I was using the transformer in a training pipeline. When there was an unseen category in one of the split, it was throwing the error related due to one of categories was unseen.

solegalli commented 3 months ago

Sorry, I get it know. I forgot that we do not expose that in the TreeEncoder.. By all means, we should certainly change that! If any of you wants to volunteer, that'd be a major help! Thank you!

93lorenzo commented 3 months ago

Thanks!! I volunteer, I will do an PR this week

93lorenzo commented 3 months ago

Hello there! I created the PR for this change !

To make the tests successful I had to change also tests/test_selection/test_single_feature_performance.py

Please have a look and let me know if there is anything to change or adjust! :)