dssg / pgdedupe

A simple command line interface to the datamade/dedupe library.
https://pgdedupe.readthedocs.io
Other
42 stars 6 forks source link

Make the minimum set of non-null fields configurable #3

Closed mbauman closed 7 years ago

mbauman commented 7 years ago

Since the script automatically merges exactly duplicate rows, it needs to be careful about overzealously merging rows that have too much missingness to reliably ensure that they refer to the same identity. E.g., the script currently looks for this:

last_name is not null AND
    (ssn is not null
    OR (first_name is not null AND dob is not null))
mbauman commented 7 years ago

Fixed by https://github.com/dssg/superdeduper/commit/c0f65b67a7bed48130f61d8428b4cbcd0cb97777