Open dorg-ekrolewicz opened 6 years ago
The current code misses any records with exact duplicates which don't also have any near duplicates that appear in clustered_dupes returned from deduper.match() in csvddedupe.py:.
Original code:
` clustered_dupes = deduper.match(unique_d, threshold)
expanded_clustered_dupes = []
for cluster, scores in clustered_dupes:
new_cluster = list(cluster)
new_scores = list(scores)
for row_id, score in zip(cluster, scores):
children = parents.get(row_id, [])
new_cluster.extend(children)
new_scores.extend([score] * len(children))
expanded_clustered_dupes.append((new_cluster, new_scores))
clustered_dupes = expanded_clustered_dupes
` Code with a fix that works locally for exact duplicates not caught above: <
clustered_dupes = deduper.match(unique_d, threshold)
expanded_clustered_dupes = []
rows_used = []
for cluster, scores in clustered_dupes:
new_cluster = list(cluster)
new_scores = list(scores)
for row_id, score in zip(cluster, scores):
children = parents.get(row_id, [])
new_cluster.extend(children)
new_scores.extend([score] * len(children))
expanded_clustered_dupes.append((new_cluster, new_scores))
# Add any parents with no clustered exact_dups but with exact dupes to expanded_clustered_dupes
# or else they are omitted counted as non duplicates.
for row, exact_dups in parents.items():
if row not in rows_used and exact_dups is not None and len(exact_dups) > 0:
new_cluster = [row]
new_cluster.extend(exact_dups)
new_scores = [1.0]
new_scores.extend([1.0] * len(exact_dups))
expanded_clustered_dupes.append((new_cluster, new_scores))
clustered_dupes = expanded_clustered_dupes
Wow! This is such an important issue. I'm very grateful for @dorg-ekrolewicz calling it out, and especially for @batesmotel34 offering a fix.
I ran this on a file with a little over 12K rows and the difference in what I'm calling "single-row clusters" is ~10000 rows without the fix, and only ~1500 rows with the fix...
That's 8500 false negatives the fix converted to true-duplicate clusters.
In the results shown above, the algorithm does a great job of assigning Cluster ID = 0 for a contact with various title changes, but for some reason it assigns a different cluster ID for identical rows ("Christine Wack" has multiple cluster ID's). Christine's case seems to be the trivial one, why would we get different cluster ID's then? (same goes for Tom Baty)
Any advice/help on where to look is much appreciated.