Closed NickCrews closed 2 years ago
i think i looked at it and it didn't perform as well. that said, i would be open to thinking of an architecture that made this more pluggable.
honestly, the architecture is actually not bad for this, you just override Dedupe.cluster() in a subclass.
For the other tasks though things aren't very pluggable. Could make it so we actually pass in delegates for all these tasks, such as Dedupe(featurizer=my_featurizer, scorer=my_scorer, clusterer=my_clusterer) and then we instead have a factory method of Dedupe.from_variable_definitions() that does what happens now. Either would be breaking, or we have really awkward arg parsing in the init()
Would https://github.com/scikit-learn-contrib/hdbscan be a good candidate for replacing the current clustering algorithm?
I'm just looking at https://hdbscan.readthedocs.io/en/latest/comparing_clustering_algorithms.html. If I understand correctly, we are currently using something similar to what they have listed as Agglomerative Clustering (slightly different because they are using ward linkage).
I think HDBSCAN has some advantages for our use case:
threshold
argument to cluster()?I think our use case is different from those examples because for N records, the number of clusters we have scales as O(N), while they are talking about a O(1) number of clusters. Not sure how that affects things.
disclaimer: I have run exactly 0 benchmarks :)