data-liberation-project / aphis-inspection-reports

Inspection data and PDFs from the USDA's Animal and Plant Health Inspection Service.
13 stars 3 forks source link

Handle duplicate entries, using first-discovered #54

Closed jsvine closed 1 year ago

jsvine commented 1 year ago

It appears that the URLs for the inspection PDFs occasionally change in ways that result in different hash_ids for the same inspection. I can't quite tell what's causing this; it only affects a subset of the inspections, the dupe-discoveries are clustered on just a few days, and they does not appear to reflect changes to the inspections themselves.

Regardless of the root cause, the effect is ~780 duplicate entries in our combined files. These are relatively easy to identify, since they have the same customer number, inspection date, and inspection ID. At this point, I have not identified a method to filter these out "on the fly" as we're fetching them, since the inspection ID is available only in the parsed PDF. So this commit introduces a bit of logic to filter out the dupes when we're generating the files in data/combined.

Here's a list of dupes, if of interest: dupes.csv

And here's when they were found:

discovered count
2023-02-27 14:33:16+00:00 32
2023-02-27 18:14:59+00:00 43
2023-03-02 14:05:13+00:00 274
2023-03-02 18:15:34+00:00 68
2023-03-09 14:10:44+00:00 291
2023-03-09 18:15:29+00:00 73

Notably: On just a few days, roughly each a week apart, but none in the past week.

jsvine commented 1 year ago

Other than integrating the latest data, ok to merge @palewire — or have qualms?

palewire commented 1 year ago

Let's get it done

jsvine commented 1 year ago

Super. Most recent data incorporated, force-repushed, and now merged. Thanks!