dedupeio / dedupe

:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.
https://docs.dedupe.io
MIT License
4.13k stars 551 forks source link

Implement connectivity constrained clustering. #285

Closed fgregg closed 9 years ago

fgregg commented 10 years ago

http://scikit-learn.org/stable/modules/clustering.html#adding-connectivity-constraints

https://github.com/scikit-learn/scikit-learn/blob/e6003bc892485756215ac4966b18720bc489e5c6/sklearn/cluster/hierarchical.py#L255

Hierarhical clustering does not handle missing data well, and has a number of performance issues: #280, #281, #279.

fgregg commented 9 years ago

@cathydeng, create a branch and swamp in connectivity constrained clustering from scikit. If it's good, we'll consider re-implementing in without depending upon scipy or scikit.

fgregg commented 9 years ago

We might want to compare this to other segmentation approaches here: http://scikit-image.org/docs/0.10.x/auto_examples/plot_segmentations.html?highlight=cluster

cathydeng commented 9 years ago

updates:

made a fix to sklearn so that agglomerative clustering works w/ a precomputed distance matrix & connectivity constraints: https://github.com/scikit-learn/scikit-learn/commit/fc51c062ce70e3e7c0083cf70ed8fc215b5535b9

b/c you need to specify the # of clusters as a parameter for agglomerative clustering, if we use it, we'll have to calculate clusters for various values of n_clusters, & select a value of n_clusters that makes the most sense (perhaps by using the threshold parameter in the dedupe cluster method?)

I'll continue working on implementing agglomerative clustering w/ connectivity constraints, & evaluating performance

fgregg commented 9 years ago

Working with @cathydeng's PR, I came to realization that this graph segmentation/community detection paradigms do not help us address the missing data problem that originally motivated us to look for something other than hierarchical agglomerative cluster.

Here's the problem in brief. Because of the vagaries of blocking, we could have made the following two pairwise comparisons.

(A,B), 0.7
(B,C), 0.3

But did not compare (A,C)

In order to use hierarchical agglomerative clustering we need a full similarity matrix.

   A   B   C
A   1 0.7   ?
B 0.7   1 0.3
C   ? 0.3   1 

What do we put in the (A,C) cells?

Right now we put zeros. But, that's not really right, the pairwise similarity between A and C is almost certainly not 0, but we have to put something in there to proceed with clustering.

This bothered me a lot, and I was hopeful that these graph paradigms could let us avoid putting in a made up similarity distance. But, it really doesn't, it just moves the place were we are making the assumption.

To a first approximation, these graph methods work on the edge weight adjacency matrix. And, what is the edge weight adjacency matrix for our little example:

   A   B   C
A   1 0.7   0
B 0.7   1 0.3
C   0 0.3   1 

This is because the adjacency matrix is

   A   B   C
A   1   1   0
B   1   1   1
C   0   1   1 

But this is not really the right adjacency matrix. All pairs of records do have some pairwise distance, so the "real" adjacency matrix should be all ones. We ultimately have made the same move we were trying to avoid.

Thank you Cathy for exploring this issue and bringing concrete implementations to light so that we could finally understand that direction does not offer what I hoped.