Open solegalli opened 1 year ago
Hi Soledad,
I went through the suggested TDS article. I see there is an existing library implemented by the author: https://github.com/smazzanti/mrmr
. How would the implementation for the Feature-engine library would be under this circumstances? I would like to contribute by adding this transformer if possible. Thanks.
is this still relevant?
Hey @MetroCat69
Yes, it'd be great to add this implementation to feature engine. Would you like to give it a go?
sure
how should we name it especially not be confused with SmartCorrelatedSelection?
MRMR or MRMRSelector or MRMRSelection (I think or transformers are called Selection instead of Selector.
one problem that arose is that there is a difference in the algorithms depending if the variables are discrete or continuous. In scikitlearn at least with MI they solve it by splitting into 2 classes: eg: https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.mutual_info_classif.html. and https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.mutual_info_regression.html#sklearn.feature_selection.mutual_info_regression should we do the same? are there any better options?
why mutual information? I thought MRMR was based on f score and correlation.
No, it is more complicated than that.
In the paper he is based on, they have several variations of the mRMR algorithm [1]. Since I can't find how they trained the random forest classifier (which hyperparameters they used), and also as far as I understand, RDC is not numerically stable and not supported natively in SciPy/Scikit-learn, I suggest we either use just the FCD variant or the following variants: MID, MIQ, FCD, FCQ [2].
F statistics shouldn't be used for classification problems, and F tests can't be used for regression problems. See Wikipedia or, actually, they even say it in the Scikit-learn docs [3].
So, we either need to create two classes (similar to what Scikit-learn does), or we need to detect if the target variable is categorical or numerical. Or any other suggestions ?Ψ( ̄∀ ̄)Ψ
find_numerical_variables
method, or should we use the method on all features always?"References: [1] https://arxiv.org/pdf/1908.05376.pdf [2] https://github.com/feature-engine/feature_engine/assets/78600473/ee484ed9-f372-46e1-9382-39493e28d0c6 [3] https://scikit-learn.org/stable/modules/feature_selection.html
?
I see, thank you for pointing that out.
For the transformers that can take classification and regression, we have an extra parameter in the init called regression which defaults to True, if I remember correctly. The user needs to enter the type of problem they want to solve. Then, if it is regression we'd use mutual_info_regression otherwise classif. I'd use sklearn functionality, and let them take care of continuous/discrete with those functions.
here you have an example of a transformer that takes regression or classification and all the checks that we do in the fit, plus the docstrings, plus the checks in the init: https://github.com/feature-engine/feature_engine/blob/main/feature_engine/discretisation/decision_tree.py
As per description here:
https://medium.com/towards-data-science/mrmr-explained-exactly-how-you-wished-someone-explained-to-you-9cf4ed27458b
and references therein.