dedupeio / dedupe

:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.
https://docs.dedupe.io
MIT License
4.13k stars 551 forks source link

Streaming matching #195

Closed fgregg closed 10 years ago

fgregg commented 10 years ago

Extend dedupe to handle duplicate identification on a N+1 basis allowing for on-the-fly detection? I.e. given a known set of contact data with N rows that underwent blocking to see if the new N+1th record is a duplicate with a match in N.

I can see of two broad approaches. In the first, for every cluster of record duplicates we either choose or construct a 'representative' record. Then the task is to match the new record with against one of these 'representative' records. If we do match, we assign the new record to the cluster.

In the second approach, we keep track of all the data, and their current cluster assignment. When we get a new record we recluster the relevant records. This approach would require we keep more information around and it would be more computationally expensive. It would also be possible that adding a new record would result in a currently clustered record being removed from a cluster. It also seems like this latter approach could be a lot more accurate.

Thoughts @emphanos, @nikitsaraf, @markhuberty, @michaelwick?

markhuberty commented 10 years ago

Which option you take seems to depend on the use case. Option 2 introduces some instability in prior data that may be undesirable for use cases that depend on a stable historical record; or for use cases where I think I have a very good handle on the "representative" records, but not on future updates. But option 2 might be better for use cases designed to generate routine snapshots, where the data are consistent within a snapshot but not between snapshots.

On Tue, Feb 04, 2014 at 08:46:19AM -0800, Forest Gregg wrote:

Extend dedupe to handle duplicate identification on a N+1 basis allowing for on-the-fly detection? I.e. given a known set of contact data with N rows that underwent blocking to see if the new N+1th record is a duplicate with a match in N.

I can see of two broad approaches. In the first, for every cluster of record duplicates we either choose or construct a 'representative' record. Then the task is to match the new record with against one of these 'representative' records. If we do match, we assign the new record to the cluster.

In the second approach, we keep track of all the data, and their current cluster assignment. When we get a new record we recluster the relevant records. This approach would require we keep more information around and it would be more computationally expensive. It would also be possible that adding a new record would result in a currently clustered record being removed from a cluster. It also seems like this latter approach could be a lot more accurate.

Thoughts @emphanos, @nikitsaraf, @markhuberty, @michaelwick?


Reply to this email directly or view it on GitHub: https://github.com/open-city/dedupe/issues/195

fgregg commented 10 years ago

@markhuberty, it should be straightforward to keep track of all the clusters we have ever assigned a record to. Basically a kind version control for cluster assignments.

Assuming that was done, would option 1 still be desirable for some cases?

emphanos commented 10 years ago

@fgregg In the case of (1) how do you see the creation of a representative record for a given cluster, are you thinking of picking two records that are the largest distance apart within a given cluster (based on some metric) and also an average record and then matching against all three as a bracket, or simply picking an average record and assigning it to that cluster and running the analysis against the matching cluster(s). There might be more than one cluster the N+1 record matches with.

fgregg commented 10 years ago

@emphanos, good question, but I think it is a bit further down the line.

Right now, I'd like to just take up the question of whether we want the possible behavior after seeing the N+1th record to be

A: add it to an existing cluster B. create a new, singleton, cluster

OR

A: add it to an existing cluster, B. split an existing cluster and add the record to one the new clusters C. create a new, singleton, cluster

On Thu, Feb 6, 2014 at 2:32 PM, Emphanos LLC notifications@github.comwrote:

@fgregg https://github.com/fgregg In the case of (1) how do you see the creation of a representative record for a given cluster, are you thinking of picking two records that are the largest distance apart within a given cluster (based on some metric) and also an average record and then matching against all three as a bracket, or simply picking an average record and assigning it to that cluster and running the analysis against the matching cluster(s). There might be more than one cluster the N+1 record matches with.

— Reply to this email directly or view it on GitHubhttps://github.com/open-city/dedupe/issues/195#issuecomment-34367498 .

773.888.2718 2231 N. Monticello Ave Chicago, IL 60647

emphanos commented 10 years ago

@fgregg What are the up-sides and down-sides of creating multiple singleton clusters? I'm assuming that for most online applications cluster recomputing would be feasible (advisable) on an ongoing basis, but not feasible when data is coming in during peak times. So it might make sense to tackle this question based on specific use cases. The always online use case would have +1 data coming in at all times that needs deduping, while the more common scenarios is probably a case where N gets clustered nightly or within a set time interval, while in the interim up to M records are clustered on the fly as N+1 until the next full clustering happens with N+M at which time the full clustering gets computed for (N+M) with online clustering resuming with (N+M)+1.

On Thu, Feb 6, 2014 at 2:47 PM, Forest Gregg notifications@github.comwrote:

@emphanos, good question, but I think it is a bit further down the line.

Right now, I'd like to just take up the question of whether we want the possible behavior after seeing the N+1th record to be

A: add it to an existing cluster B. create a new, singleton, cluster

OR

A: add it to an existing cluster, B. split an existing cluster and add the record to one the new clusters C. create a new, singleton, cluster

On Thu, Feb 6, 2014 at 2:32 PM, Emphanos LLC notifications@github.comwrote:

@fgregg https://github.com/fgregg In the case of (1) how do you see the creation of a representative record for a given cluster, are you thinking of picking two records that are the largest distance apart within a given cluster (based on some metric) and also an average record and then matching against all three as a bracket, or simply picking an average record and assigning it to that cluster and running the analysis against the matching cluster(s). There might be more than one cluster the N+1 record matches with.

Reply to this email directly or view it on GitHub< https://github.com/open-city/dedupe/issues/195#issuecomment-34367498> .

773.888.2718 2231 N. Monticello Ave Chicago, IL 60647

Reply to this email directly or view it on GitHubhttps://github.com/open-city/dedupe/issues/195#issuecomment-34368912 .

fgregg commented 10 years ago

@emphanos, sorry for not being clear enough

I'm trying to distinguish between two broad classes of behavior.

In the first class, clusters can only grow or be born. I.e. if we see a new record, we either add to an existing cluster, or if doesn't match any existing cluster, we start a new cluster, with this record as the sole element.

In the second class, clusters can also split. I.e. if we see a new record, we can will recluster this record with similar records. This can result in a. the record being added to an existing record, b. the splitting of an existing cluster into two smaller clusters, one of which will contain the new record, or c. the creation of a new cluster with this record as the sole element.

Make sense?

On Thu, Feb 6, 2014 at 7:28 PM, Emphanos LLC notifications@github.comwrote:

@fgregg What are the up-sides and down-sides of creating multiple singleton clusters? I'm assuming that for most online applications cluster recomputing would be feasible (advisable) on an ongoing basis, but not feasible when data is coming in during peak times. So it might make sense to tackle this question based on specific use cases. The always online use case would have +1 data coming in at all times that needs deduping, while the more common scenarios is probably a case where N gets clustered nightly or within a set time interval, while in the interim up to M records are clustered on the fly as N+1 until the next full clustering happens with N+M at which time the full clustering gets computed for (N+M) with online clustering resuming with (N+M)+1.

Young-Jin Kim, Managing Director Emphanos LLC 20 N. Wacker Drive, Suite 1200 Chicago, IL 60606

youngjin@emphanos.com 312-282-0931 (direct) 312-970-1276 (office)

On Thu, Feb 6, 2014 at 2:47 PM, Forest Gregg <notifications@github.com

wrote:

@emphanos, good question, but I think it is a bit further down the line.

Right now, I'd like to just take up the question of whether we want the possible behavior after seeing the N+1th record to be

A: add it to an existing cluster B. create a new, singleton, cluster

OR

A: add it to an existing cluster, B. split an existing cluster and add the record to one the new clusters C. create a new, singleton, cluster

On Thu, Feb 6, 2014 at 2:32 PM, Emphanos LLC <notifications@github.com wrote:

@fgregg https://github.com/fgregg In the case of (1) how do you see

the

creation of a representative record for a given cluster, are you thinking of picking two records that are the largest distance apart within a given cluster (based on some metric) and also an average record and then matching against all three as a bracket, or simply picking an average record and assigning it to that cluster and running the analysis against the matching cluster(s). There might be more than one cluster the N+1 record matches with.

Reply to this email directly or view it on GitHub< https://github.com/open-city/dedupe/issues/195#issuecomment-34367498>

.

773.888.2718 2231 N. Monticello Ave Chicago, IL 60647

Reply to this email directly or view it on GitHub< https://github.com/open-city/dedupe/issues/195#issuecomment-34368912>

.

— Reply to this email directly or view it on GitHubhttps://github.com/open-city/dedupe/issues/195#issuecomment-34395653 .

773.888.2718 2231 N. Monticello Ave Chicago, IL 60647

michaelwick commented 10 years ago

Inevitably the coreference system will have errors, many of which occur due to the system's current lack of information about the authors. New incoming author mentions are likely to provide the system with new information about its current hypothesis and ultimately enable the system to correct some of its past mistakes. As a hypothetical illustration: imagine that your database contains two author mentions that actually refer to the same author, but the coreference system predicts they refer to different authors because none of the two mention's co-author lists overlap. Now suppose you observe a third mention of the real-world author that actually contains co-authors that overlap with both co-author lists of the original two mentions. The former system must choose between the two original author mentions, but the latter might be able to rectify the original recall error and merge all three mentions into the same cluster (based on the new co-author evidence).

We have actually observed this phenomenon (allowing a system to change its mind improves accuracy when new mentions arrives) on both author coreference and cross-document noun-phrase coreference. For example, see Table 3 in our workshop paper http://people.cs.umass.edu/~sameer/files/linking-akbc13.pdf in which we report the accuracy of the original "database" as new mentions arrive and the system changes its mind about past decisions. Generally, the accuracy improves; although I can imagine scenarios in which new ambiguous mentions might confuse the system and cause errors.

Ultimately this boils down to a speed/accuracy trade-off: the first system being faster, and the second system being more accurate.

fgregg commented 10 years ago

http://aclweb.org/anthology//C/C10/C10-2121.pdf http://link.springer.com/chapter/10.1007/978-3-642-28569-1_5 http://nlp.cs.rpi.edu/kbp/2014/elreading.html http://www.cs.utexas.edu/~ml/papers/encyc-eacl-06.pdf

chancyk commented 10 years ago

@fgregg, I am looking to implementing a system that is almost a hybrid of both of the scenarios you've mentioned once user interaction is considered, where we will operate within scenario 2 if a user has never validated the records, and if a user has validated a cluster of records as being correct, then that cluster must be locked from splitting but may still grow (scenario 1). The other side effect of user validation seems to be a set of exclusion pairs. If a user excludes one record R out of a cluster of N records as incorrect, then record R must never again be clustered with the N-1 remaining records.

Thus the clustering algorithm would need knowledge of clusters that may only grow and excluded pairs to never cluster with each other.

We're also considering canonicalization which you previously referenced in this issue.

fgregg commented 10 years ago

Cool. Let's chat about this soon.

On Tue, Jun 10, 2014 at 3:36 PM, Chancy Kennedy notifications@github.com wrote:

@fregg, I am looking to implementing a system that is almost a hybrid of both of the scenarios you've mentioned once user interaction is considered, where we will operate within scenario 2 if a user has never validated the records, and if a user has validated a cluster of records as being correct, then that cluster must be locked from splitting but may still grow (scenario 1). The other side effect of user validation seems to be a set of exclusion pairs. If a user excludes one record R out of a cluster of N records as incorrect, then record R must never again be clustered with the N-1 remaining records.

Thus the clustering algorithm would need knowledge of clusters that may only grow and excluded pairs to never cluster with each other.

We're also considering canonicalization which you previously referenced in this issue https://github.com/datamade/dedupe/issues/18.

— Reply to this email directly or view it on GitHub https://github.com/datamade/dedupe/issues/195#issuecomment-45668048.

773.888.2718 2231 N. Monticello Ave Chicago, IL 60647

chancyk commented 10 years ago

So I've been looking at the current clustering algorithm to see if there's a way to structure the input data such that the streaming behavior is achieved. I was able to make a few observations:

  1. The cluster ID is always the smallest ID in a cluster.
  2. A record can only join a cluster if it has a score with each record in the existing cluster.
  3. A single high scoring pair can steal a record from a larger potential cluster.

Given the above, fastcluster.linkage can be used to grow locked clusters by passing in only new data that has a score for each record in an existing cluster, and where the new data has IDs larger than those in the existing records. This will preserve the Cluster IDs and ensure that comparisons that would remove records from a cluster are not performed.

A second pass could then be performed excluding any records in a locked cluster to allow the clustering algorithm to re-cluster anything at will. The problem here seems to be in how to express what is happening to the clusters, especially if the clustering is being used to persist an entity between runs, as a new cluster may end up containing multiple records whose IDs were previously Cluster IDs, as in the following example (not based on actual scores):

A new record 'Will Smith' has been introduced with no Cluster ID as record ID, 7.

ID Locked Cluster ID Name
1 1 1 Bill Smith
2 0 2 William Smith Jr
3 0 2 William Smith
4 0 4 Johnny Will Smith
5 1 1 Will John Smith
6 0 6 Willard Smith
7 0 Will Smith

After a dedupe run, records 3, 4, 6, and 7 are now joined into a cluster with ID 3. Two previously existing Cluster IDs, 4 and 6, have been absolved for a new ID that did not previously exist, 3. Maybe 'Will John Smith' was also a match but it was excluded due to being previously locked.

ID Locked Cluster ID Name
1 1 1 Bill Smith
2 0 2 William Smith Jr
3 0 3 William Smith
4 0 3 Johnny Will Smith
5 1 1 Will John Smith
6 0 3 Willard Smith
7 0 3 Will Smith

This seems like business logic, but in this situation would we want to re-point everything to 3, or choose either 4 or 6 as having grown? There's no real significance to which record happened to be the one that shares an ID with the Cluster ID, which might imply different behavior than if we were building a canonical record that represented the cluster.

fgregg commented 10 years ago

Is point 2 really true? I'm not sure it is.

chancyk commented 10 years ago

I'm out on vacation but I'll confirm point 2 next week.

chancyk commented 10 years ago

@fgregg, the notebook below displays the behavior I was referring to with point 2.

http://nbviewer.ipython.org/gist/chancyk/1a14425a35e9dd048a8b

webmaven commented 10 years ago

For my use-cases, @fgregg's first behavior is preferable, with a few caveats:

  1. It may be useful to require user intervention to decide whether to join an existing cluster or create a new one if a new record matches an existing cluster but very poorly (this is a very likely scenario with messy data sets that have partial records)
  2. The canonical record is in some sense the union of data from all records in the cluster
  3. Where a 'name' exists independent of an id, the names (once normalized) should be added to a list of aliases on the canonical record
  4. While canonical names are unique to a cluster, aliases are not necessarily unique to a cluster. An alias may in fact match (or strongly resemble) a different cluster's canonical name (this would be rare).
  5. It is possible that a user would decide at some point to merge two existing clusters, which would also require merging the canonical records or creating a new canonical record from the records in the merged cluster
  6. Splitting clusters is a low probability event for my use-case, as it is possible but not very likely (except perhaps when a user mistakenly merged clusters, in which case undoing the merge would likely be a better solution)
fgregg commented 10 years ago

Thanks very much for this @chancyk, you really helped illuminate something that I did not really appreciate about the clustering.

It really seems like point 2 should not always be true. Particularly, looking at your first example, I think that '4' should be included with '1','2', and '3'. The reason that this is happening is because when we don't observe a link between 1 and 4, we that cell in the distance matrix with a 0. This is not a very good way to handle missing data.

I've been growing increasingly dissatisfied with how we are doing the clustered, and I'd like to look into implementing connectivity constrained clustering as in #285.

chancyk commented 10 years ago

@fgregg, for point 2 with the current clustering approach, I was planning on performing the usual blocking pass to find related clusters (since everything but the new records have already been clustered), then performing a second pass to force a comparison between all of the records of a related cluster, which I think would produce a value in the matrix where you're saying it would otherwise default to 0.

This is just to take advantage of the fact that the cluster will grow if it's fully connected with any new potential records.

fgregg commented 10 years ago

Talked this over with @chancyk in IRC.

I think his approach is too coupled with implementation details of how we are currently doing clustering. I recommend.

webmaven commented 10 years ago

Look at all the 'cluster scores' to decide which if any cluster to merge with.

And if no 'cluster scores' meet some (arbitrary) threshold, start a new singleton cluster? Or is that what was created in the 'blocking' operation?

How does this affect the construction of a 'representative' record for the cluster (what I was calling a 'canonical' record)?

fgregg commented 10 years ago

closed by #295