Investigate a Markov Random Field approach to clustering

After blocking and scoring, we get an array of predicted probabilities for pairs of records. We then need some way of combining these pair-wise probabilities into groups of records that we believe all refer to the same entity.

Right now, we take a two stage approach to this

first we treat these pairwise scores as an edgelist and find the connected components
for each connected components we use hierarchical clustering for further partitioning.

The hierarchical clustering approach has two main problems. The first is mainly theoretic: hierarchical clustering really assumes that there is a metric distance between points and probabilities are not a metric distance.

The second, and much more serious, problem is that this form of clustering requires a fully defined distance matrix. But, the connected components are usually not fully dense networks. Now, when we create the distance matrix, we treat missing edges as having no probability of making a match. This definitely not the right solution.

So, there are two ways I can think of doing this better.

we can make the connected components fully dense. We are investigating this in https://github.com/dedupeio/dedupe/pull/552

pro This resolves the problem of handling "missing" edges
con a simple implementation can lead to massive networks which can obviate the benefits of blocking in the first place
con a complicated implementation is... complicated and I'm not sure what the right way to do this is.

we can try a markov random field approach looking at this as a Potts Model

@bbengfort, @rebeccabilbro, @mattandahalfew, I think this could be a fun one to play with.

dedupeio / dedupe

Investigate a Markov Random Field approach to clustering #572