Data De-duplication - Githubissues

All data has to be verified manually eventually before entering the final services collection. We want to de-dupe our DB in order to prevent the waste of volunteer time in the manual verification, but also to ensure there aren't duplicates appearing in ShelterApp.

Record Linkage methods:

Deterministically - rule based
fuzzy matching (probabilistic matching), current implementation for each scraper: utils.py
Machine Learning (uses fuzzy matching): dedupe

We also probably want some way to tell if it's working whether that be through some kind of visualization, some examples of duplicates that it resolves or a combination of the two.

Finally, on the engineering side, we're going to need to perform the de-duplication on the full database,

I was also thinking if we go with 2 or 3, by ranking which items are least likely to be duplicates and showing those to the volunteers first we could make this even more efficient.

There may also be better/easier ways of doing this, these are the methods that I found with some preliminary research.

ShelterApp / AddResources

Data De-duplication #73

Update