Consider HDBSCAN as clustering algorithm

NickCrews commented 2 years ago

Would https://github.com/scikit-learn-contrib/hdbscan be a good candidate for replacing the current clustering algorithm?

I'm just looking at https://hdbscan.readthedocs.io/en/latest/comparing_clustering_algorithms.html. If I understand correctly, we are currently using something similar to what they have listed as Agglomerative Clustering (slightly different because they are using ward linkage).

I think HDBSCAN has some advantages for our use case:

noise tolerant, eg agglomerative clustering partitions the data, while HDBSCAN can determine which samples are outliers and are therefore singletons.
more intuitive parameters. We could drop the threshold argument to cluster()?
better at dealing with variable-density clusters. I think this is relevant to us? eg some records are super obvious clusters, while others are not nearly so obvious.

I think our use case is different from those examples because for N records, the number of clusters we have scales as O(N), while they are talking about a O(1) number of clusters. Not sure how that affects things.

disclaimer: I have run exactly 0 benchmarks :)

fgregg commented 2 years ago

i think i looked at it and it didn't perform as well. that said, i would be open to thinking of an architecture that made this more pluggable.

NickCrews commented 2 years ago

honestly, the architecture is actually not bad for this, you just override Dedupe.cluster() in a subclass.

For the other tasks though things aren't very pluggable. Could make it so we actually pass in delegates for all these tasks, such as Dedupe(featurizer=my_featurizer, scorer=my_scorer, clusterer=my_clusterer) and then we instead have a factory method of Dedupe.from_variable_definitions() that does what happens now. Either would be breaking, or we have really awkward arg parsing in the init()

dedupeio / dedupe

Consider HDBSCAN as clustering algorithm #1092