ThomasBury / arfs

All Relevant Feature Selection
MIT License
116 stars 12 forks source link

Downsampling #3

Closed brunofacca closed 3 years ago

brunofacca commented 3 years ago

Hi Thomas. Thank you for open sourcing this library, it's very useful.

I found your idea of using Isolation Forest for downsampling the observations passed to SHAP very interesting. I'm wondering if you also tried HDBSCAN clustering. It is used, for example, as the default downsampling method in the interpret_community library and it has the advantage of automatically selecting the optimal number of clusters (i.e., the optimal number of samples to better represent the dataset). There is a Python implementation of HDBSCAN in [in this package] (https://hdbscan.readthedocs.io/en/latest/).

ThomasBury commented 3 years ago

Hi Thomas. Thank you for open sourcing this library, it's very useful.

I found your idea of using Isolation Forest for downsampling the observations passed to SHAP very interesting. I'm wondering if you also tried HDBSCAN clustering. It is used, for example, as the default downsampling method in the interpret_community library and it has the advantage of automatically selecting the optimal number of clusters (i.e., the optimal number of samples to better represent the dataset). There is a Python implementation of HDBSCAN in [in this package] (https://hdbscan.readthedocs.io/en/latest/).

Hi Bruno, great if you find this useful!

I guess you refer to BorutaShap which uses IsolationForest to reduce (drastically it seems) the run time (names and functionalities are very close, so I understand the confusion ^^)? This is an interesting idea indeed. I did not implement that yet. This downsampling could be done "outside" the difference would be that the model would be fit on downsampled data as well (in BorutaShap, only the computation of the SHAP values is done on the down-sampled data). How would that impact the quality of the feature importance? This needs a bit (or a lot :D) of experimentation.

Just to clarify (because I didn't write a proper doc yet) the difference between Leshy (Boruta extension in this package) and BorutaShap are

brunofacca commented 3 years ago

Hi Thomas,

I apologize for the confusion. I'll blame it on the lack of sleep due to having a newborn at home :smile:.

Thank you for the clarifications.

I'm not sure about fitting the model on downsampled data, I'd expect that it would impact the quality of the results. My data is too large to run SHAP with all validation observations, so some kind of downsampling is needed. I got reasonable results with K-means (as used in the SHAP docs) but I'll also try Isolation Forest and HDBSCAN soon for comparison. If you ever implement that or any other kind of downsampling, I'd love to hear about it.

ThomasBury commented 3 years ago

Indeed, the model quality would be impacted. For the SHAP values, the kernel-SHAP methodology seems to be ok with few (1000 or 10 000) samples, using k-means to summarize the data. For tree-SHAP I need to re-read the algorithm and paper(s) to be sure that doing something similar will not impact the results.

Anyway, if you "watch" for new releases, that day may come :smile:

brunofacca commented 3 years ago

Yes, I'll definitely keep an eye on this project :slightly_smiling_face:. When I eventually experiment with this, I'll share my progress as well.

ThomasBury commented 2 years ago

Hi Thomas,

I apologize for the confusion. I'll blame it on the lack of sleep due to having a newborn at home 😄.

Thank you for the clarifications.

I'm not sure about fitting the model on downsampled data, I'd expect that it would impact the quality of the results. My data is too large to run SHAP with all validation observations, so some kind of downsampling is needed. I got reasonable results with K-means (as used in the SHAP docs) but I'll also try Isolation Forest and HDBSCAN soon for comparison. If you ever implement that or any other kind of downsampling, I'd love to hear about it.

Hi @brunofacca,

I (superficially) looked at sampling rows to reduce the run time (dropping useless columns was already done). If you have a large data set (let's say >1e6 rows) with a "few" columns (<50 or <100) then random sampling is the way to go. Statistics backing up the properties of the sample vs population. Sklearn provide a bunch of random samplers (vanilla, stratified, group, group-stratified, etc).

If you need even more drastic "sampling" then there are two methods:

brunofacca commented 2 years ago

Hi Thomas,

Thanks for sharing that, sounds interesting and promising. However, I've pulled the plug on my ML project so I haven't been using the library (or anything else ML-related) for a few months now. Thanks again for the library, it was very useful to me when I used it.