dedupeio / dedupe

:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.
https://docs.dedupe.io
MIT License
4.08k stars 549 forks source link

Why not simply use find_cliques on classifier prob connected graph to find duplicates ? #908

Closed svjack closed 3 years ago

svjack commented 3 years ago

I try to use partition method with threshold to find duplicates, but it seems not better than only use find_cliques on a threshold to make a graph and find connect components as duplicates. Can you explain this for me ?

fgregg commented 3 years ago

it is often better