feature-engine / feature_engine

Feature engineering package with sklearn like functionality
https://feature-engine.trainindata.com/
BSD 3-Clause "New" or "Revised" License
1.81k stars 303 forks source link

new transformer: feature selection using mrmr #495

Open solegalli opened 1 year ago

solegalli commented 1 year ago

As per description here:

https://medium.com/towards-data-science/mrmr-explained-exactly-how-you-wished-someone-explained-to-you-9cf4ed27458b

and references therein.

Manuelhrokr commented 1 year ago

Hi Soledad,

I went through the suggested TDS article. I see there is an existing library implemented by the author: https://github.com/smazzanti/mrmr. How would the implementation for the Feature-engine library would be under this circumstances? I would like to contribute by adding this transformer if possible. Thanks.

MetroCat69 commented 3 months ago

is this still relevant?

solegalli commented 3 months ago

Hey @MetroCat69

Yes, it'd be great to add this implementation to feature engine. Would you like to give it a go?

MetroCat69 commented 3 months ago

sure

MetroCat69 commented 3 months ago

how should we name it especially not be confused with SmartCorrelatedSelection?

solegalli commented 3 months ago

MRMR or MRMRSelector or MRMRSelection (I think or transformers are called Selection instead of Selector.

MetroCat69 commented 3 months ago

one problem that arose is that there is a difference in the algorithms depending if the variables are discrete or continuous. In scikitlearn at least with MI they solve it by splitting into 2 classes: eg: https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.mutual_info_classif.html. and https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.mutual_info_regression.html#sklearn.feature_selection.mutual_info_regression should we do the same? are there any better options?

solegalli commented 3 months ago

why mutual information? I thought MRMR was based on f score and correlation.

MetroCat69 commented 3 months ago

No, it is more complicated than that.

  1. In the paper he is based on, they have several variations of the mRMR algorithm [1]. Since I can't find how they trained the random forest classifier (which hyperparameters they used), and also as far as I understand, RDC is not numerically stable and not supported natively in SciPy/Scikit-learn, I suggest we either use just the FCD variant or the following variants: MID, MIQ, FCD, FCQ [2].

  2. F statistics shouldn't be used for classification problems, and F tests can't be used for regression problems. See Wikipedia or, actually, they even say it in the Scikit-learn docs [3].

So, we either need to create two classes (similar to what Scikit-learn does), or we need to detect if the target variable is categorical or numerical. Or any other suggestions ?Ψ( ̄∀ ̄)Ψ

  1. Another thing I don't know is this: Should we automatically filter the method only for numerical variables using the find_numerical_variables method, or should we use the method on all features always?"

References: [1] https://arxiv.org/pdf/1908.05376.pdf [2] https://github.com/feature-engine/feature_engine/assets/78600473/ee484ed9-f372-46e1-9382-39493e28d0c6 [3] https://scikit-learn.org/stable/modules/feature_selection.html

MetroCat69 commented 3 months ago

?

solegalli commented 3 months ago

I see, thank you for pointing that out.

For the transformers that can take classification and regression, we have an extra parameter in the init called regression which defaults to True, if I remember correctly. The user needs to enter the type of problem they want to solve. Then, if it is regression we'd use mutual_info_regression otherwise classif. I'd use sklearn functionality, and let them take care of continuous/discrete with those functions.

here you have an example of a transformer that takes regression or classification and all the checks that we do in the fit, plus the docstrings, plus the checks in the init: https://github.com/feature-engine/feature_engine/blob/main/feature_engine/discretisation/decision_tree.py