JohnMcCambridge / CARES

CARES Act data: PPP, EIDL and more.
GNU General Public License v3.0
3 stars 8 forks source link

Duplicates #1

Open kbmorales opened 4 years ago

kbmorales commented 4 years ago

I am worried this file has repeats of data already present in the individual states/territories folders.

JohnMcCambridge commented 4 years ago

Is this a merge issue or a data issue on their part. I presume a data issue, because by the definition of the files the individual state/territories should only be up to 150k. Can you share some examples we can test?

kbmorales commented 4 years ago

I think you're right. I found around 4k duplicates in the data but it doesn't seem to be from the 150k file.

JohnMcCambridge commented 4 years ago

If we just use duplicated I'm not convinced we are going to drop true dupes, especially if applied to the <150k files. It would seem entirely possible that an entry in there could be identical to another row, but be a legitimate row (same loan amount, to a similar business in the same zip code, for example). Keeping in mind especially many columns have NAs it's quite easy for a false dupe to appear based on only the barest of detail (loan amount + zip code, basically). 3860cfa93f5e8348b0df95ebb710e1aa13c39b4a

kbmorales commented 4 years ago

That's a good point, but it'd also be based on the other variables with high completeness (DateApproved, NAICSCode, Lender, and JobsRetained being the more information-rich ones), so I'm less concerned about them not being true duplicates.

While we can't confirm without the BusinessName, I personally feel fine about removing dupes with duplicated but I can retain them for now.

JohnMcCambridge commented 4 years ago

For the baseline script I would suggest we stick with a minimalist/'do no harm' principle. We could also create an 'add-on' script which applies more maximalist interventions, like row drops (rather than flagging of problematic rows) and de-duping. That way users can decide which version to use when data diving.