[ENH] HDBSCAN for downsampling

brunofacca commented 3 years ago

Hello. Thank you for this library!

I found your idea of using an Isolation Forest for downsampling the observations passed to SHAP very interesting. I'm wondering if you also tried HDBSCAN clustering. It is used, for example, as the default downsampling method in the interpret_community library and it has the advantage of automatically selecting the optimal number of clusters (i.e., the optimal number of samples to better represent the dataset). There is a Python implementation of HDBSCAN in in this package.

Regards, Bruno

Ekeany commented 3 years ago

Hi Bruno,

No worries it was something interesting to do over the lockdown period.

It is an interesting problem indeed to choose a representative sample of your data so that you can get a reasonable approximation for the global feature importance of your model. Especially when your entire dataset is probably just a sample of all the real world possibilities that exist !.

That is an interesting solution I love the HDBSCAN algorithm, one concern I do have is that it is a pain to install. As a dependency I always have trouble with it.

So in their approach do they just use one sample per cluster or do they weight the number of samples taken from each cluster by the relative size of the cluster to the entire population.

Best, Eoghan.

brunofacca commented 3 years ago

Hi Eoghan,

Yes, it is an interesting problem and I think that its importance is often overlooked (not in your project though).

I don't remember having any problems the last time I installed HDBSCAN using Poetry, but it's been a while so maybe the requirements changed.

I've just taken a longer look at their code and it turns out that I misunderstood the documentation. They don't really use HDBSCAN for downsampling. Here is what they do:

Randomly downsample to 10k observations using sklearn.utils.resample (with replacement)
Use DBSCAN only for finding the "optimal number of clusters" (k) (here)
Do a second round of random downsampling to "k" observations.

Maybe I'm missing something, but using random sampling does not seem like the best approach to me.

Anyway, in a few weeks from now I'll compare results of K-means (the default approach from SHAP docs) vs. your Isolation Forest approach vs. HDBSCAN. Will let you know the results.

BTW, weighting the number of samples taken from each cluster by the relative size of the cluster to the entire population seems like a good idea.

kmedved commented 3 years ago

FWIW - one downside is that HDBSCAN seems to be materially (orders of magnitude) slower than an Isolation Forest.

I have not had any issues with installation however.

Ekeany commented 3 years ago

Hi Guys,

Maybe the HDBSCAN install isn't as bad as I remember haha, that "Poetry" dependency manager looks interesting though !.

Yes that is one of the benefits of the Isolation forest is that it is relatively fast. Although if HDBSCAN produces a more representative sample then it would depend upon the use case.

Another idea just throwing it out their could be, to take as an example a random sample of the size you require from your dataset. Then train a simple classifier to predict if a row came from the sample or from the original dataset (similar to the data drift concept). If the classifier gets close to 50% accuracy then the sample would be indistinguishable from your original dataset if not choose another sample etc run until threshold or some runtime?.

Yes if you test them Bruno definitely keep me in the loop I would be interested to see the results.

brunofacca commented 3 years ago

Hi Eoghan,

I've been using Poetry for a while and I really like it. I come from Ruby where people mostly don't even think about dependency management because the "default" tool (called "Bundler") "just works". Poetry was the closest thing I could find on Python.

The idea of training a classifier to predict if a row came from the sample or from the original dataset seems very interesting. If you ever experiment with it, I'd be very interested in the results.

Let's keep in touch. Feature selection is one of the most important parts, if not the most important, in the project I'm working on.

brunofacca commented 3 years ago

Hi @Ekeany. Today I've stumbled across an interesting piece of code related to your classifier idea so I thought I'd write it down here in case we eventually want to experiment with it.

It's in this Kaggle kernel, click the "Show hidden code" link below "Lets check a Covariate Shift of the feature. This means that we will try to distinguish whether a values correspond to a training set or to a testing set."

Have a great weekend.

Ekeany / Boruta-Shap

[ENH] HDBSCAN for downsampling #43