ShelterApp / AddResources

http://shelterapp.org/
11 stars 10 forks source link

Data De-duplication #73

Open picklesueat opened 3 years ago

picklesueat commented 3 years ago

All data has to be verified manually eventually before entering the final services collection. We want to de-dupe our DB in order to prevent the waste of volunteer time in the manual verification, but also to ensure there aren't duplicates appearing in ShelterApp.

Record Linkage methods:

  1. Deterministically - rule based
  2. fuzzy matching (probabilistic matching), current implementation for each scraper: utils.py
  3. Machine Learning (uses fuzzy matching): dedupe

We also probably want some way to tell if it's working whether that be through some kind of visualization, some examples of duplicates that it resolves or a combination of the two.

Finally, on the engineering side, we're going to need to perform the de-duplication on the full database,

I was also thinking if we go with 2 or 3, by ranking which items are least likely to be duplicates and showing those to the volunteers first we could make this even more efficient.

There may also be better/easier ways of doing this, these are the methods that I found with some preliminary research.

picklesueat commented 3 years ago

Update

TO-DO

  1. Current dupe script
    1. Add more blocking rules (in-addition to zip code) in utils.py finding of duplicate
    2. Maybe use SVM to learn how to weight different attributes
    3. Add thorough tests to make sure it is working as intended
  2. Run de-dupe.io on same db
  3. Compare and contrast results through some kind of jupyter notebook that shows which results were picked up by both, neither, and some graphs of it