After blocking and scoring, we get an array of predicted probabilities for pairs of records. We then need some way of combining these pair-wise probabilities into groups of records that we believe all refer to the same entity.
Right now, we take a two stage approach to this
first we treat these pairwise scores as an edgelist and find the connected components
for each connected components we use hierarchical clustering for further partitioning.
The hierarchical clustering approach has two main problems. The first is mainly theoretic: hierarchical clustering really assumes that there is a metric distance between points and probabilities are not a metric distance.
The second, and much more serious, problem is that this form of clustering requires a fully defined distance matrix. But, the connected components are usually not fully dense networks. Now, when we create the distance matrix, we treat missing edges as having no probability of making a match. This definitely not the right solution.
So, there are two ways I can think of doing this better.
After blocking and scoring, we get an array of predicted probabilities for pairs of records. We then need some way of combining these pair-wise probabilities into groups of records that we believe all refer to the same entity.
Right now, we take a two stage approach to this
The hierarchical clustering approach has two main problems. The first is mainly theoretic: hierarchical clustering really assumes that there is a metric distance between points and probabilities are not a metric distance.
The second, and much more serious, problem is that this form of clustering requires a fully defined distance matrix. But, the connected components are usually not fully dense networks. Now, when we create the distance matrix, we treat missing edges as having no probability of making a match. This definitely not the right solution.
So, there are two ways I can think of doing this better.
@bbengfort, @rebeccabilbro, @mattandahalfew, I think this could be a fun one to play with.