Currently gazetteer_example.py has an issue with the cluster_id assignment. (see #134 )
This pull resolves that issue by assigning a unique cluster_id to each entry in the messy dataset, and then assigning that same cluster_id to all the matches from the canonical dataset. It allows entries in the canonical dataset to have multiple cluster_ids, and then outputs a csv that can be sorted by cluster_id to see each entry in messy dataset and all its corresponding matches from the canonical dataset.
Currently gazetteer_example.py has an issue with the cluster_id assignment. (see #134 )
This pull resolves that issue by assigning a unique cluster_id to each entry in the messy dataset, and then assigning that same cluster_id to all the matches from the canonical dataset. It allows entries in the canonical dataset to have multiple cluster_ids, and then outputs a csv that can be sorted by cluster_id to see each entry in messy dataset and all its corresponding matches from the canonical dataset.