dedupeio / csvdedupe

:id: Command line tool for deduplicating CSV files
Other
412 stars 81 forks source link

Cluster ID is different for exact records #88

Open dorg-ekrolewicz opened 6 years ago

dorg-ekrolewicz commented 6 years ago
screen shot 2018-09-13 at 1 08 00 pm 1

In the results shown above, the algorithm does a great job of assigning Cluster ID = 0 for a contact with various title changes, but for some reason it assigns a different cluster ID for identical rows ("Christine Wack" has multiple cluster ID's). Christine's case seems to be the trivial one, why would we get different cluster ID's then? (same goes for Tom Baty)

Any advice/help on where to look is much appreciated.

batesmotel34 commented 5 years ago

The current code misses any records with exact duplicates which don't also have any near duplicates that appear in clustered_dupes returned from deduper.match() in csvddedupe.py:.

Original code:

` clustered_dupes = deduper.match(unique_d, threshold)

    expanded_clustered_dupes = []
    for cluster, scores in clustered_dupes:
        new_cluster = list(cluster)
        new_scores = list(scores)
        for row_id, score in zip(cluster, scores):
            children = parents.get(row_id, [])
            new_cluster.extend(children)
            new_scores.extend([score] * len(children))
        expanded_clustered_dupes.append((new_cluster, new_scores))

    clustered_dupes = expanded_clustered_dupes

` Code with a fix that works locally for exact duplicates not caught above: <

    clustered_dupes = deduper.match(unique_d, threshold)

    expanded_clustered_dupes = []

   rows_used = []
    for cluster, scores in clustered_dupes:
        new_cluster = list(cluster)
        new_scores = list(scores)
        for row_id, score in zip(cluster, scores):
            children = parents.get(row_id, [])
            new_cluster.extend(children)
            new_scores.extend([score] * len(children))
        expanded_clustered_dupes.append((new_cluster, new_scores))

    # Add any parents with no clustered exact_dups but with exact dupes to expanded_clustered_dupes
    # or else they are omitted counted as non duplicates.
    for row, exact_dups in parents.items():
        if row not in rows_used and exact_dups is not None and len(exact_dups) > 0:
            new_cluster = [row]
            new_cluster.extend(exact_dups)
            new_scores = [1.0]
            new_scores.extend([1.0] * len(exact_dups))
            expanded_clustered_dupes.append((new_cluster, new_scores))

    clustered_dupes = expanded_clustered_dupes
zacharysyoung commented 2 years ago

Wow! This is such an important issue. I'm very grateful for @dorg-ekrolewicz calling it out, and especially for @batesmotel34 offering a fix.

I ran this on a file with a little over 12K rows and the difference in what I'm calling "single-row clusters" is ~10000 rows without the fix, and only ~1500 rows with the fix...

That's 8500 false negatives the fix converted to true-duplicate clusters.