Pull request to merge a new feature in pandas-dedupe: gazetteer_dataframe.
This feature will complete the 'deduplication menu' offered by pandas-dedupe: (i) standard deduplication, (ii) gazetteer deduplication and (iii) link dataframes.
The main reason for adding the gazetteer is that centroids in hierarchical clustering correspond to the most frequent observation in each cluster. This can have an impact on the canonicalization process if you update your messy data over time. The gazetteer fixes this issue by keeping the centroid fixed (i.e. centroids are token in the gazette).
Also, the gazetteer approach can be faster if you have a gazette available :)
Two main specs of the gazetteer:
gazetteer_dataframe can assign each record to more than one cluster.
My view is that this is not really desireable. Therefore, gazetteer_dedupe assigns records to the cluster it is most confident of.
gazetteer_dataframe accepts only one variable for deduplication and canonicalization (no blocking allowed).
I tested and it seemed ok to me.
But feel free to have a look and suggest amendments.
This PR closes the the issue: #2 [and help with this one: #13 ]
List of changes:
gazetteer_dataframe.py - main script for gazetteer deduplication;
utilility_funcitons.py - I've cleaned the code a bit. These are very minor changes.
Readme - I've updated the readme to include documentation for the gazetteer.
.gitignore - I've added a few extensions.
init - added the gazetter.
deedupe_dataframe.py - I've improved a bit the docstrings.
Pull request to merge a new feature in pandas-dedupe:
gazetteer_dataframe
. This feature will complete the 'deduplication menu' offered by pandas-dedupe: (i) standard deduplication, (ii) gazetteer deduplication and (iii) link dataframes.The main reason for adding the gazetteer is that centroids in hierarchical clustering correspond to the most frequent observation in each cluster. This can have an impact on the canonicalization process if you update your messy data over time. The gazetteer fixes this issue by keeping the centroid fixed (i.e. centroids are token in the gazette). Also, the gazetteer approach can be faster if you have a gazette available :)
Two main specs of the gazetteer:
gazetteer_dedupe
assigns records to the cluster it is most confident of.I tested and it seemed ok to me. But feel free to have a look and suggest amendments.
This PR closes the the issue: #2 [and help with this one: #13 ]
List of changes: