feature-engine / feature_engine

Feature engineering package with sklearn like functionality
https://feature-engine.trainindata.com/
BSD 3-Clause "New" or "Revised" License
1.91k stars 311 forks source link

drop_correlated_features sets are not always correlated #327

Open FedericoMontana opened 2 years ago

FedericoMontana commented 2 years ago

Is your feature request related to a problem? Please describe. The class drop_correlated_features creates a set of correlated features which might not work accurately some times; it assumes transitivity which is not a property of correlation. It is also not deterministic, depending on the order of the columns, the sets might end up being different.

Describe the solution you'd like Each feature in a set should be correlated to the others over the threshold stablished, currently not always happen with the existing implementation.

Describe alternatives you've considered You might implement hierarchical clustering to create the sets. See here for example: https://stackabuse.com/hierarchical-clustering-with-python-and-scikit-learn/

Additional context The solution is rather simple. @solegalli let me know if you are accepting pull requests and I can try an initial solution, in case this is indeed an issue instead of a misinterpretation from my side

solegalli commented 2 years ago

Hi @FedericoMontana

Thanks a lot for the suggestion.

I agree that the current implementation is non-deterministic.

I also see the point that not all features within the group could be correlated to each other. However, they will be correlated to the first feature in the group, and then, they would, in theory provide "redundant" information, at least respect to the first feature in the group.

So I suggest, if we were to introduce a feature selection method based on hierarchical clustering, it would be by creating a complete new class, and not by replacing the code in the existing one. You probably would agree with this?

Do you have any reference (blog, article, video, etc) that uses hierarchical clustering for feature selection that you could link to this issue? Or how did you come up with this idea?

FedericoMontana commented 2 years ago

I don't have any resource detailing exactly this, other than the link I shared in my initial message. I'm writing a simple python library with one class implementing this (still workin in progress) https://github.com/FedericoMontana/instrumentum/blob/master/src/instrumentum/feature_selection/correlation.py

Perhaps the class could be parametrizable for the user to select the grouping method? The class could be composited using this solution from sklearn: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html

solegalli commented 2 years ago

The link from stackabuse is about hierarchical clustering, but after a quick look it does not mention anything regarding using this method for feature selection.

Personally, I have not heard of using clustering to select features. That is why, I would be keen on some reference that shows that this idea has been successfully implemented somewhere.

The thing is, depending on the distance that you choose for the clusters, you will have more or less features in a group, and it is a bit hard to say, a priori, if those features provide the same information. How would a user tune the D?

kylegilde commented 2 years ago

Should the set of features be a clique, where they are all correlated with each other?

https://en.wikipedia.org/wiki/Clique_(graph_theory)

solegalli commented 2 years ago

I am not familiar with graph theory at the moment.

I would welcome references on the use of the suggested methods for feature selection. This is kind of a per-requisite to go ahead with a class implementation.

kylegilde commented 2 years ago

What if we added an argument to the class that checked and retained only the features in the group that are correlated to all the other features at the specified threshold? I think that this would remove the non-determinism.

FedericoMontana commented 2 years ago

What if we added an argument to the class that checked and retained only the features in the group that are correlated to all the other features at the specified threshold? I think that this would remove the non-determinism.

At some point in the past I tried this solution, but ended up disregarding it as at the end it was not as useful as I had thought. To ilustrase this, say that you have feature a correlated with b, c & d (but b, c & d are not correlated with each other). With the current implementation you would end up with only 1 cluster, using the logic you described you would end up with 3 clusters. If you extrapolate this to a dataset of hundreds of features the final number of cluster ends up being huge with a large degree of redundancy among them. On the other hand, hierarchical clustering offers a more elegant alternative.

In my opinion, these are the 3 options there are for this library:

1) Leave it as-is. It seems this library is related to heuristic methods. The current implementation, although not perfect, is quite good in what it does and will offer, most of the time, a very good local solution.

2) Move the implementation to a hierarchical clustering solution, and delegate the problem related to the parameters to determine how many clusters to create. Delegate in this case could be achieved by requesting, as a constructor parameter, a clustering class from sklearn: https://scikit-learn.org/stable/modules/clustering.html#hierarchical-clustering, within the fit this class would deal with the clustering based on whatever parameters the user used to initialize it. From a software engineering this would be an elegant approach, but I think that the idea of feature-engine is abstracting the user from certain complexities, and by requesting the user to initialize the sklearn clustering class might defeat that spirit.

3) Embed the logic related to the clustering within the feature-engine class, and allow the user to indicate by a simple parameter what they want to use, which could be; "classic" (i.e. what it does today), "number_of_clusters" (i.e. the class will end up with this number of clusters) or "threshold" (using cophenetic distance to come up with the clusters)

For (3), I have created a basic implementation of a class which does that https://github.com/FedericoMontana/instrumentum/blob/master/src/instrumentum/feature_selection/correlation.py, I can help to enhance and port it to feature-engine should you be interested in that.

Finally, I believe Soledad asked for papers, there are plenty, this one provides a nice overview of methods: https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.295.8115&rep=rep1&type=pdf Also, R looks like having a similar implementation: https://cran.r-project.org/web/packages/clustvarsel/vignettes/clustvarsel.html

solegalli commented 2 years ago

Hi @FedericoMontana

Thanks for the info and the ideas. I am not deeply familiar with feature selection for clustering. So this is very useful.

Which of the methods in the article you linked is the equivalent of the one you are proposing?

qiaott commented 2 years ago

Hey, @solegalli I had encountered the same issue when I used SmartCorrelatedSelection class. I would like to also participate in solving this issue.

I had an idea of fixing the issue and I prepared a notebook and put all the functions in utils.py to illustrate my solution. It would be great if you can have a look.

https://github.com/qiaott/path_to_opensource/tree/main/feature_engine_issue_327

Happy to hear any feedback and discuss it in a call.

solegalli commented 2 years ago

Thank @qiaott and everyone else (@FedericoMontana @kylegilde ) for discussions on this topic.

Apologies @qiaott for the late response.

To summarize, the issue that we are discussing is that the selection using correlation (DropCorrelatedFeatures and SmartCorrelation classes)

One solution is to use hierarchical clustering instead of correlation. Regarding this solution I would like to see an article, video, blog where this solution is discussed / implemented for feature selection.

@FedericoMontana linked an article with methods to select features for clustering. That is useful, but if we were to implement those, they would be in new classes, because the logic is different from correlation.

@kylegilde suggested graph theory, again, we would need to see a resource where this is used for feature selection.

@qiaott could you please summarize in a few words how your solution resolves the issues stated above?

dlaprins commented 1 year ago

I think your two bullet points provide a nice summary of the issue. I am not aware of any solution to both problems simultaneously.

I would propose the following: when looping over features to remove, make use of an ordering of the features, with methodology as a variable. The ordering can for example be defined by options like I proposed in #612 (number of features collinear with a feature, correlation with the target, IV), or like @qiaott proposes (number of null entries in the feature, variance of the feature values).

This solves the problem outlined in the first bullet point, while the second bullet point would still be an open problem.