CodeForPittsburgh / food-access-map-data

Data for the food access map
MIT License
8 stars 18 forks source link

Dedupe food stores in merged dataset #16

Closed hellonewman closed 3 years ago

hellonewman commented 4 years ago

We need a scripted way to run through the merged data, flag and remove duplicate entries.

There a couple possible approaches:

crocojim18 commented 4 years ago

It's important to note that many of the duplicates will have information that other records may not have. I think it would be best that the information be consolidated to the best of its ability so we don't lose the information in other fields.

cgmoreno commented 4 years ago

A good working example to try either approach you have listed Ellie could be for the Bloomfield Market. There are three entries for this right now, from three different sources. They differ slightly in name/address (subtly) and then differ in lat/long (though not enough to change the GEOID). To the second point, they also contain different info (SNAP, fresh-produce_healthy etc. attributes) that we would want to maximize information across. Whatever the solution is to de-dup while maximizing info, this would be a good test case. If I have something before the Feb 5 meetup, I'll comment here.

maxachis commented 4 years ago

I've created four sub-issues related to this:

45 - Research Deduplication

46 - Try deduplication using Uber's H3 System

47 - Examine and Optimize de_dup_fun.R

48 - Develop Python/R Address Steming for Deduplication

hellonewman commented 3 years ago

even more sub-issues!

57

50

60

58

maxachis commented 3 years ago

Consolidated this issue and sub-issues into the "Merge Duplicates Dataflow" milestone, so I'm closing this one just to clean things up and to make the little milestone progress bar go up a little more.