Downsampling - Githubissues

brunofacca commented 3 years ago

Hi Thomas. Thank you for open sourcing this library, it's very useful.

I found your idea of using Isolation Forest for downsampling the observations passed to SHAP very interesting. I'm wondering if you also tried HDBSCAN clustering. It is used, for example, as the default downsampling method in the interpret_community library and it has the advantage of automatically selecting the optimal number of clusters (i.e., the optimal number of samples to better represent the dataset). There is a Python implementation of HDBSCAN in [in this package] (https://hdbscan.readthedocs.io/en/latest/).

ThomasBury commented 3 years ago

Hi Thomas. Thank you for open sourcing this library, it's very useful.

I found your idea of using Isolation Forest for downsampling the observations passed to SHAP very interesting. I'm wondering if you also tried HDBSCAN clustering. It is used, for example, as the default downsampling method in the interpret_community library and it has the advantage of automatically selecting the optimal number of clusters (i.e., the optimal number of samples to better represent the dataset). There is a Python implementation of HDBSCAN in [in this package] (https://hdbscan.readthedocs.io/en/latest/).

Hi Bruno, great if you find this useful!

I guess you refer to BorutaShap which uses IsolationForest to reduce (drastically it seems) the run time (names and functionalities are very close, so I understand the confusion ^^)? This is an interesting idea indeed. I did not implement that yet. This downsampling could be done "outside" the difference would be that the model would be fit on downsampled data as well (in BorutaShap, only the computation of the SHAP values is done on the down-sampled data). How would that impact the quality of the feature importance? This needs a bit (or a lot :D) of experimentation.

Just to clarify (because I didn't write a proper doc yet) the difference between Leshy (Boruta extension in this package) and BorutaShap are

Leshy handles categorical (object columns) automatically
Leshy provides 3 types of feat imp (native, SHAP and permutation importance)
Leshy implementation is very close to the BorutaPy one while BorutaShap is more a complete rewriting

brunofacca commented 3 years ago

Hi Thomas,

I apologize for the confusion. I'll blame it on the lack of sleep due to having a newborn at home :smile:.

Thank you for the clarifications.

I'm not sure about fitting the model on downsampled data, I'd expect that it would impact the quality of the results. My data is too large to run SHAP with all validation observations, so some kind of downsampling is needed. I got reasonable results with K-means (as used in the SHAP docs) but I'll also try Isolation Forest and HDBSCAN soon for comparison. If you ever implement that or any other kind of downsampling, I'd love to hear about it.

ThomasBury commented 3 years ago

Indeed, the model quality would be impacted. For the SHAP values, the kernel-SHAP methodology seems to be ok with few (1000 or 10 000) samples, using k-means to summarize the data. For tree-SHAP I need to re-read the algorithm and paper(s) to be sure that doing something similar will not impact the results.

Anyway, if you "watch" for new releases, that day may come :smile:

brunofacca commented 3 years ago

Yes, I'll definitely keep an eye on this project :slightly_smiling_face:. When I eventually experiment with this, I'll share my progress as well.

ThomasBury commented 2 years ago

Hi Thomas,

I apologize for the confusion. I'll blame it on the lack of sleep due to having a newborn at home 😄.

Thank you for the clarifications.

I'm not sure about fitting the model on downsampled data, I'd expect that it would impact the quality of the results. My data is too large to run SHAP with all validation observations, so some kind of downsampling is needed. I got reasonable results with K-means (as used in the SHAP docs) but I'll also try Isolation Forest and HDBSCAN soon for comparison. If you ever implement that or any other kind of downsampling, I'd love to hear about it.

Hi @brunofacca,

I (superficially) looked at sampling rows to reduce the run time (dropping useless columns was already done). If you have a large data set (let's say >1e6 rows) with a "few" columns (<50 or <100) then random sampling is the way to go. Statistics backing up the properties of the sample vs population. Sklearn provide a bunch of random samplers (vanilla, stratified, group, group-stratified, etc).

If you need even more drastic "sampling" then there are two methods:

the same implementation as in BorutaShap, using isolation forest (minor modif)
a clustering-based method. Searching clusters that are representative of the dataset, the more clusters the "closer" to the dataset. The issue is that for clustering, a distance needs to be computed between rows. If you have 1e5 rows then the distance matrix is 1e5 x 1e5 and therefore not really an option. However, you can apply it after random sampling or in chunks. E.G. summarizing every 1e4 rows into 1e2 clusters and concatenate. I'll perhaps implement this in a new package.

brunofacca commented 2 years ago

Hi Thomas,

Thanks for sharing that, sounds interesting and promising. However, I've pulled the plug on my ML project so I haven't been using the library (or anything else ML-related) for a few months now. Thanks again for the library, it was very useful to me when I used it.

ThomasBury / arfs

Downsampling #3