Handle duplicate entries, using first-discovered

jsvine commented 1 year ago

It appears that the URLs for the inspection PDFs occasionally change in ways that result in different hash_ids for the same inspection. I can't quite tell what's causing this; it only affects a subset of the inspections, the dupe-discoveries are clustered on just a few days, and they does not appear to reflect changes to the inspections themselves.

Regardless of the root cause, the effect is ~780 duplicate entries in our combined files. These are relatively easy to identify, since they have the same customer number, inspection date, and inspection ID. At this point, I have not identified a method to filter these out "on the fly" as we're fetching them, since the inspection ID is available only in the parsed PDF. So this commit introduces a bit of logic to filter out the dupes when we're generating the files in data/combined.

Here's a list of dupes, if of interest: dupes.csv

And here's when they were found:

discovered	count
2023-02-27 14:33:16+00:00	32
2023-02-27 18:14:59+00:00	43
2023-03-02 14:05:13+00:00	274
2023-03-02 18:15:34+00:00	68
2023-03-09 14:10:44+00:00	291
2023-03-09 18:15:29+00:00	73

Notably: On just a few days, roughly each a week apart, but none in the past week.

jsvine commented 1 year ago

Other than integrating the latest data, ok to merge @palewire — or have qualms?

palewire commented 1 year ago

Let's get it done

jsvine commented 1 year ago

Super. Most recent data incorporated, force-repushed, and now merged. Thanks!

data-liberation-project / aphis-inspection-reports

Handle duplicate entries, using first-discovered #54