Lyonk71 / pandas-dedupe

Simplifies use of the Dedupe library via Pandas
135 stars 30 forks source link

Add gazetteer #34

Closed ieriii closed 3 years ago

ieriii commented 3 years ago

Pull request to merge a new feature in pandas-dedupe: gazetteer_dataframe. This feature will complete the 'deduplication menu' offered by pandas-dedupe: (i) standard deduplication, (ii) gazetteer deduplication and (iii) link dataframes.

The main reason for adding the gazetteer is that centroids in hierarchical clustering correspond to the most frequent observation in each cluster. This can have an impact on the canonicalization process if you update your messy data over time. The gazetteer fixes this issue by keeping the centroid fixed (i.e. centroids are token in the gazette). Also, the gazetteer approach can be faster if you have a gazette available :)

Two main specs of the gazetteer:

I tested and it seemed ok to me. But feel free to have a look and suggest amendments.

This PR closes the the issue: #2 [and help with this one: #13 ]

List of changes: