dedupeio / dedupe

:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.
https://docs.dedupe.io
MIT License
4.14k stars 550 forks source link

Investigate a Markov Random Field approach to clustering #572

Closed fgregg closed 4 years ago

fgregg commented 7 years ago

After blocking and scoring, we get an array of predicted probabilities for pairs of records. We then need some way of combining these pair-wise probabilities into groups of records that we believe all refer to the same entity.

Right now, we take a two stage approach to this

  1. first we treat these pairwise scores as an edgelist and find the connected components
  2. for each connected components we use hierarchical clustering for further partitioning.

The hierarchical clustering approach has two main problems. The first is mainly theoretic: hierarchical clustering really assumes that there is a metric distance between points and probabilities are not a metric distance.

The second, and much more serious, problem is that this form of clustering requires a fully defined distance matrix. But, the connected components are usually not fully dense networks. Now, when we create the distance matrix, we treat missing edges as having no probability of making a match. This definitely not the right solution.

So, there are two ways I can think of doing this better.

  1. we can make the connected components fully dense. We are investigating this in https://github.com/dedupeio/dedupe/pull/552
  1. we can try a markov random field approach looking at this as a Potts Model

@bbengfort, @rebeccabilbro, @mattandahalfew, I think this could be a fun one to play with.

fgregg commented 4 years ago

superseded by #826