Open kbmorales opened 4 years ago
Is this a merge issue or a data issue on their part. I presume a data issue, because by the definition of the files the individual state/territories should only be up to 150k. Can you share some examples we can test?
I think you're right. I found around 4k duplicates in the data but it doesn't seem to be from the 150k file.
If we just use duplicated
I'm not convinced we are going to drop true dupes, especially if applied to the <150k files. It would seem entirely possible that an entry in there could be identical to another row, but be a legitimate row (same loan amount, to a similar business in the same zip code, for example). Keeping in mind especially many columns have NAs it's quite easy for a false dupe to appear based on only the barest of detail (loan amount + zip code, basically). 3860cfa93f5e8348b0df95ebb710e1aa13c39b4a
That's a good point, but it'd also be based on the other variables with high completeness (DateApproved, NAICSCode, Lender, and JobsRetained being the more information-rich ones), so I'm less concerned about them not being true duplicates.
While we can't confirm without the BusinessName, I personally feel fine about removing dupes with duplicated
but I can retain them for now.
For the baseline script I would suggest we stick with a minimalist/'do no harm' principle. We could also create an 'add-on' script which applies more maximalist interventions, like row drops (rather than flagging of problematic rows) and de-duping. That way users can decide which version to use when data diving.
I am worried this file has repeats of data already present in the individual states/territories folders.