It appears that the URLs for the inspection PDFs occasionally change in ways that result in different hash_ids for the same inspection. I can't quite tell what's causing this; it only affects a subset of the inspections, the dupe-discoveries are clustered on just a few days, and they does not appear to reflect changes to the inspections themselves.
Regardless of the root cause, the effect is ~780 duplicate entries in our combined files. These are relatively easy to identify, since they have the same customer number, inspection date, and inspection ID. At this point, I have not identified a method to filter these out "on the fly" as we're fetching them, since the inspection ID is available only in the parsed PDF. So this commit introduces a bit of logic to filter out the dupes when we're generating the files in data/combined.
It appears that the URLs for the inspection PDFs occasionally change in ways that result in different
hash_id
s for the same inspection. I can't quite tell what's causing this; it only affects a subset of the inspections, the dupe-discoveries are clustered on just a few days, and they does not appear to reflect changes to the inspections themselves.Regardless of the root cause, the effect is ~780 duplicate entries in our combined files. These are relatively easy to identify, since they have the same customer number, inspection date, and inspection ID. At this point, I have not identified a method to filter these out "on the fly" as we're fetching them, since the inspection ID is available only in the parsed PDF. So this commit introduces a bit of logic to filter out the dupes when we're generating the files in
data/combined
.Here's a list of dupes, if of interest: dupes.csv
And here's when they were found:
Notably: On just a few days, roughly each a week apart, but none in the past week.