dedupeio / dedupe-examples

:id: Examples for using the dedupe library
MIT License
404 stars 216 forks source link

Resolving cluster_id issue in gazetteer_example.py #135

Open jmiller558 opened 1 year ago

jmiller558 commented 1 year ago

Currently gazetteer_example.py has an issue with the cluster_id assignment. (see #134 )

This pull resolves that issue by assigning a unique cluster_id to each entry in the messy dataset, and then assigning that same cluster_id to all the matches from the canonical dataset. It allows entries in the canonical dataset to have multiple cluster_ids, and then outputs a csv that can be sorted by cluster_id to see each entry in messy dataset and all its corresponding matches from the canonical dataset.