Closed hellonewman closed 3 years ago
It's important to note that many of the duplicates will have information that other records may not have. I think it would be best that the information be consolidated to the best of its ability so we don't lose the information in other fields.
A good working example to try either approach you have listed Ellie could be for the Bloomfield Market. There are three entries for this right now, from three different sources. They differ slightly in name/address (subtly) and then differ in lat/long (though not enough to change the GEOID). To the second point, they also contain different info (SNAP, fresh-produce_healthy etc. attributes) that we would want to maximize information across. Whatever the solution is to de-dup while maximizing info, this would be a good test case. If I have something before the Feb 5 meetup, I'll comment here.
I've created four sub-issues related to this:
even more sub-issues!
Consolidated this issue and sub-issues into the "Merge Duplicates Dataflow" milestone, so I'm closing this one just to clean things up and to make the little milestone progress bar go up a little more.
We need a scripted way to run through the merged data, flag and remove duplicate entries.
There a couple possible approaches: