Closed fgregg closed 10 years ago
Which option you take seems to depend on the use case. Option 2 introduces some instability in prior data that may be undesirable for use cases that depend on a stable historical record; or for use cases where I think I have a very good handle on the "representative" records, but not on future updates. But option 2 might be better for use cases designed to generate routine snapshots, where the data are consistent within a snapshot but not between snapshots.
On Tue, Feb 04, 2014 at 08:46:19AM -0800, Forest Gregg wrote:
Extend dedupe to handle duplicate identification on a N+1 basis allowing for on-the-fly detection? I.e. given a known set of contact data with N rows that underwent blocking to see if the new N+1th record is a duplicate with a match in N.
I can see of two broad approaches. In the first, for every cluster of record duplicates we either choose or construct a 'representative' record. Then the task is to match the new record with against one of these 'representative' records. If we do match, we assign the new record to the cluster.
In the second approach, we keep track of all the data, and their current cluster assignment. When we get a new record we recluster the relevant records. This approach would require we keep more information around and it would be more computationally expensive. It would also be possible that adding a new record would result in a currently clustered record being removed from a cluster. It also seems like this latter approach could be a lot more accurate.
Thoughts @emphanos, @nikitsaraf, @markhuberty, @michaelwick?
Reply to this email directly or view it on GitHub: https://github.com/open-city/dedupe/issues/195
@markhuberty, it should be straightforward to keep track of all the clusters we have ever assigned a record to. Basically a kind version control for cluster assignments.
Assuming that was done, would option 1 still be desirable for some cases?
@fgregg In the case of (1) how do you see the creation of a representative record for a given cluster, are you thinking of picking two records that are the largest distance apart within a given cluster (based on some metric) and also an average record and then matching against all three as a bracket, or simply picking an average record and assigning it to that cluster and running the analysis against the matching cluster(s). There might be more than one cluster the N+1 record matches with.
@emphanos, good question, but I think it is a bit further down the line.
Right now, I'd like to just take up the question of whether we want the possible behavior after seeing the N+1th record to be
A: add it to an existing cluster B. create a new, singleton, cluster
OR
A: add it to an existing cluster, B. split an existing cluster and add the record to one the new clusters C. create a new, singleton, cluster
On Thu, Feb 6, 2014 at 2:32 PM, Emphanos LLC notifications@github.comwrote:
@fgregg https://github.com/fgregg In the case of (1) how do you see the creation of a representative record for a given cluster, are you thinking of picking two records that are the largest distance apart within a given cluster (based on some metric) and also an average record and then matching against all three as a bracket, or simply picking an average record and assigning it to that cluster and running the analysis against the matching cluster(s). There might be more than one cluster the N+1 record matches with.
— Reply to this email directly or view it on GitHubhttps://github.com/open-city/dedupe/issues/195#issuecomment-34367498 .
773.888.2718 2231 N. Monticello Ave Chicago, IL 60647
@fgregg What are the up-sides and down-sides of creating multiple singleton clusters? I'm assuming that for most online applications cluster recomputing would be feasible (advisable) on an ongoing basis, but not feasible when data is coming in during peak times. So it might make sense to tackle this question based on specific use cases. The always online use case would have +1 data coming in at all times that needs deduping, while the more common scenarios is probably a case where N gets clustered nightly or within a set time interval, while in the interim up to M records are clustered on the fly as N+1 until the next full clustering happens with N+M at which time the full clustering gets computed for (N+M) with online clustering resuming with (N+M)+1.
On Thu, Feb 6, 2014 at 2:47 PM, Forest Gregg notifications@github.comwrote:
@emphanos, good question, but I think it is a bit further down the line.
Right now, I'd like to just take up the question of whether we want the possible behavior after seeing the N+1th record to be
A: add it to an existing cluster B. create a new, singleton, cluster
OR
A: add it to an existing cluster, B. split an existing cluster and add the record to one the new clusters C. create a new, singleton, cluster
On Thu, Feb 6, 2014 at 2:32 PM, Emphanos LLC notifications@github.comwrote:
@fgregg https://github.com/fgregg In the case of (1) how do you see the creation of a representative record for a given cluster, are you thinking of picking two records that are the largest distance apart within a given cluster (based on some metric) and also an average record and then matching against all three as a bracket, or simply picking an average record and assigning it to that cluster and running the analysis against the matching cluster(s). There might be more than one cluster the N+1 record matches with.
Reply to this email directly or view it on GitHub< https://github.com/open-city/dedupe/issues/195#issuecomment-34367498> .
773.888.2718 2231 N. Monticello Ave Chicago, IL 60647
Reply to this email directly or view it on GitHubhttps://github.com/open-city/dedupe/issues/195#issuecomment-34368912 .
@emphanos, sorry for not being clear enough
I'm trying to distinguish between two broad classes of behavior.
In the first class, clusters can only grow or be born. I.e. if we see a new record, we either add to an existing cluster, or if doesn't match any existing cluster, we start a new cluster, with this record as the sole element.
In the second class, clusters can also split. I.e. if we see a new record, we can will recluster this record with similar records. This can result in a. the record being added to an existing record, b. the splitting of an existing cluster into two smaller clusters, one of which will contain the new record, or c. the creation of a new cluster with this record as the sole element.
Make sense?
On Thu, Feb 6, 2014 at 7:28 PM, Emphanos LLC notifications@github.comwrote:
@fgregg What are the up-sides and down-sides of creating multiple singleton clusters? I'm assuming that for most online applications cluster recomputing would be feasible (advisable) on an ongoing basis, but not feasible when data is coming in during peak times. So it might make sense to tackle this question based on specific use cases. The always online use case would have +1 data coming in at all times that needs deduping, while the more common scenarios is probably a case where N gets clustered nightly or within a set time interval, while in the interim up to M records are clustered on the fly as N+1 until the next full clustering happens with N+M at which time the full clustering gets computed for (N+M) with online clustering resuming with (N+M)+1.
Young-Jin Kim, Managing Director Emphanos LLC 20 N. Wacker Drive, Suite 1200 Chicago, IL 60606
youngjin@emphanos.com 312-282-0931 (direct) 312-970-1276 (office)
On Thu, Feb 6, 2014 at 2:47 PM, Forest Gregg <notifications@github.com
wrote:
@emphanos, good question, but I think it is a bit further down the line.
Right now, I'd like to just take up the question of whether we want the possible behavior after seeing the N+1th record to be
A: add it to an existing cluster B. create a new, singleton, cluster
OR
A: add it to an existing cluster, B. split an existing cluster and add the record to one the new clusters C. create a new, singleton, cluster
On Thu, Feb 6, 2014 at 2:32 PM, Emphanos LLC <notifications@github.com wrote:
@fgregg https://github.com/fgregg In the case of (1) how do you see
the
creation of a representative record for a given cluster, are you thinking of picking two records that are the largest distance apart within a given cluster (based on some metric) and also an average record and then matching against all three as a bracket, or simply picking an average record and assigning it to that cluster and running the analysis against the matching cluster(s). There might be more than one cluster the N+1 record matches with.
Reply to this email directly or view it on GitHub< https://github.com/open-city/dedupe/issues/195#issuecomment-34367498>
.
773.888.2718 2231 N. Monticello Ave Chicago, IL 60647
Reply to this email directly or view it on GitHub< https://github.com/open-city/dedupe/issues/195#issuecomment-34368912>
.
— Reply to this email directly or view it on GitHubhttps://github.com/open-city/dedupe/issues/195#issuecomment-34395653 .
773.888.2718 2231 N. Monticello Ave Chicago, IL 60647
Inevitably the coreference system will have errors, many of which occur due to the system's current lack of information about the authors. New incoming author mentions are likely to provide the system with new information about its current hypothesis and ultimately enable the system to correct some of its past mistakes. As a hypothetical illustration: imagine that your database contains two author mentions that actually refer to the same author, but the coreference system predicts they refer to different authors because none of the two mention's co-author lists overlap. Now suppose you observe a third mention of the real-world author that actually contains co-authors that overlap with both co-author lists of the original two mentions. The former system must choose between the two original author mentions, but the latter might be able to rectify the original recall error and merge all three mentions into the same cluster (based on the new co-author evidence).
We have actually observed this phenomenon (allowing a system to change its mind improves accuracy when new mentions arrives) on both author coreference and cross-document noun-phrase coreference. For example, see Table 3 in our workshop paper http://people.cs.umass.edu/~sameer/files/linking-akbc13.pdf in which we report the accuracy of the original "database" as new mentions arrive and the system changes its mind about past decisions. Generally, the accuracy improves; although I can imagine scenarios in which new ambiguous mentions might confuse the system and cause errors.
Ultimately this boils down to a speed/accuracy trade-off: the first system being faster, and the second system being more accurate.
@fgregg, I am looking to implementing a system that is almost a hybrid of both of the scenarios you've mentioned once user interaction is considered, where we will operate within scenario 2 if a user has never validated the records, and if a user has validated a cluster of records as being correct, then that cluster must be locked from splitting but may still grow (scenario 1). The other side effect of user validation seems to be a set of exclusion pairs. If a user excludes one record R out of a cluster of N records as incorrect, then record R must never again be clustered with the N-1 remaining records.
Thus the clustering algorithm would need knowledge of clusters that may only grow and excluded pairs to never cluster with each other.
We're also considering canonicalization which you previously referenced in this issue.
Cool. Let's chat about this soon.
On Tue, Jun 10, 2014 at 3:36 PM, Chancy Kennedy notifications@github.com wrote:
@fregg, I am looking to implementing a system that is almost a hybrid of both of the scenarios you've mentioned once user interaction is considered, where we will operate within scenario 2 if a user has never validated the records, and if a user has validated a cluster of records as being correct, then that cluster must be locked from splitting but may still grow (scenario 1). The other side effect of user validation seems to be a set of exclusion pairs. If a user excludes one record R out of a cluster of N records as incorrect, then record R must never again be clustered with the N-1 remaining records.
Thus the clustering algorithm would need knowledge of clusters that may only grow and excluded pairs to never cluster with each other.
We're also considering canonicalization which you previously referenced in this issue https://github.com/datamade/dedupe/issues/18.
— Reply to this email directly or view it on GitHub https://github.com/datamade/dedupe/issues/195#issuecomment-45668048.
773.888.2718 2231 N. Monticello Ave Chicago, IL 60647
So I've been looking at the current clustering algorithm to see if there's a way to structure the input data such that the streaming behavior is achieved. I was able to make a few observations:
Given the above, fastcluster.linkage can be used to grow locked clusters by passing in only new data that has a score for each record in an existing cluster, and where the new data has IDs larger than those in the existing records. This will preserve the Cluster IDs and ensure that comparisons that would remove records from a cluster are not performed.
A second pass could then be performed excluding any records in a locked cluster to allow the clustering algorithm to re-cluster anything at will. The problem here seems to be in how to express what is happening to the clusters, especially if the clustering is being used to persist an entity between runs, as a new cluster may end up containing multiple records whose IDs were previously Cluster IDs, as in the following example (not based on actual scores):
A new record 'Will Smith' has been introduced with no Cluster ID as record ID, 7.
ID | Locked | Cluster ID | Name |
---|---|---|---|
1 | 1 | 1 | Bill Smith |
2 | 0 | 2 | William Smith Jr |
3 | 0 | 2 | William Smith |
4 | 0 | 4 | Johnny Will Smith |
5 | 1 | 1 | Will John Smith |
6 | 0 | 6 | Willard Smith |
7 | 0 | Will Smith |
After a dedupe
run, records 3, 4, 6, and 7 are now joined into a cluster with ID 3. Two previously existing Cluster IDs, 4 and 6, have been absolved for a new ID that did not previously exist, 3. Maybe 'Will John Smith' was also a match but it was excluded due to being previously locked.
ID | Locked | Cluster ID | Name |
---|---|---|---|
1 | 1 | 1 | Bill Smith |
2 | 0 | 2 | William Smith Jr |
3 | 0 | 3 | William Smith |
4 | 0 | 3 | Johnny Will Smith |
5 | 1 | 1 | Will John Smith |
6 | 0 | 3 | Willard Smith |
7 | 0 | 3 | Will Smith |
This seems like business logic, but in this situation would we want to re-point everything to 3, or choose either 4 or 6 as having grown? There's no real significance to which record happened to be the one that shares an ID with the Cluster ID, which might imply different behavior than if we were building a canonical record that represented the cluster.
Is point 2 really true? I'm not sure it is.
I'm out on vacation but I'll confirm point 2 next week.
@fgregg, the notebook below displays the behavior I was referring to with point 2.
http://nbviewer.ipython.org/gist/chancyk/1a14425a35e9dd048a8b
For my use-cases, @fgregg's first behavior is preferable, with a few caveats:
Thanks very much for this @chancyk, you really helped illuminate something that I did not really appreciate about the clustering.
It really seems like point 2 should not always be true. Particularly, looking at your first example, I think that '4' should be included with '1','2', and '3'. The reason that this is happening is because when we don't observe a link between 1 and 4, we that cell in the distance matrix with a 0. This is not a very good way to handle missing data.
I've been growing increasingly dissatisfied with how we are doing the clustered, and I'd like to look into implementing connectivity constrained clustering as in #285.
@fgregg, for point 2 with the current clustering approach, I was planning on performing the usual blocking pass to find related clusters (since everything but the new records have already been clustered), then performing a second pass to force a comparison between all of the records of a related cluster, which I think would produce a value in the matrix where you're saying it would otherwise default to 0.
This is just to take advantage of the fact that the cluster will grow if it's fully connected with any new potential records.
Talked this over with @chancyk in IRC.
I think his approach is too coupled with implementation details of how we are currently doing clustering. I recommend.
Look at all the 'cluster scores' to decide which if any cluster to merge with.
And if no 'cluster scores' meet some (arbitrary) threshold, start a new singleton cluster? Or is that what was created in the 'blocking' operation?
How does this affect the construction of a 'representative' record for the cluster (what I was calling a 'canonical' record)?
closed by #295
Extend dedupe to handle duplicate identification on a N+1 basis allowing for on-the-fly detection? I.e. given a known set of contact data with N rows that underwent blocking to see if the new N+1th record is a duplicate with a match in N.
I can see of two broad approaches. In the first, for every cluster of record duplicates we either choose or construct a 'representative' record. Then the task is to match the new record with against one of these 'representative' records. If we do match, we assign the new record to the cluster.
In the second approach, we keep track of all the data, and their current cluster assignment. When we get a new record we recluster the relevant records. This approach would require we keep more information around and it would be more computationally expensive. It would also be possible that adding a new record would result in a currently clustered record being removed from a cluster. It also seems like this latter approach could be a lot more accurate.
Thoughts @emphanos, @nikitsaraf, @markhuberty, @michaelwick?