dedupeio / dedupe

:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.
https://docs.dedupe.io
MIT License
4.14k stars 550 forks source link

Consider HDBSCAN as clustering algorithm #1092

Closed NickCrews closed 2 years ago

NickCrews commented 2 years ago

Would https://github.com/scikit-learn-contrib/hdbscan be a good candidate for replacing the current clustering algorithm?

I'm just looking at https://hdbscan.readthedocs.io/en/latest/comparing_clustering_algorithms.html. If I understand correctly, we are currently using something similar to what they have listed as Agglomerative Clustering (slightly different because they are using ward linkage).

I think HDBSCAN has some advantages for our use case:

I think our use case is different from those examples because for N records, the number of clusters we have scales as O(N), while they are talking about a O(1) number of clusters. Not sure how that affects things.

disclaimer: I have run exactly 0 benchmarks :)

fgregg commented 2 years ago

i think i looked at it and it didn't perform as well. that said, i would be open to thinking of an architecture that made this more pluggable.

NickCrews commented 2 years ago

honestly, the architecture is actually not bad for this, you just override Dedupe.cluster() in a subclass.

For the other tasks though things aren't very pluggable. Could make it so we actually pass in delegates for all these tasks, such as Dedupe(featurizer=my_featurizer, scorer=my_scorer, clusterer=my_clusterer) and then we instead have a factory method of Dedupe.from_variable_definitions() that does what happens now. Either would be breaking, or we have really awkward arg parsing in the init()