data-liberation-project / aphis-inspection-reports

Inspection data and PDFs from the USDA's Animal and Plant Health Inspection Service.
14 stars 3 forks source link

Migrate hash_id after APHIS server change, using URL parameter rather than full URL #42

Closed jsvine closed 1 year ago

jsvine commented 1 year ago

Welp, #25 happened. Bad news, but also good news.

Bad news

Last week, it seems APHIS switched Salesforce servers — or at least the URLs from which it's serving the inspection report PDF. The most obvious change is that they're serving from https://aphis--c.na107.content.force.com[...] instead of https://aphis--c.na21.content.force.com[...], although other components of the URL changed too.

Unfortunately, this broke (a) our deduplication strategy and (b) our hash_id file-naming scheme, both of, which depended on those full URLs.

In the absence of a fix, we'll end up with a bunch of duplicated entries and PDFs.

Good news

As #25 suggested, this was likely to happen eventually, and perhaps it's better that it happened sooner rather than later.

Also: The APHIS change helps in the sense that it provided a much-needed clue to figuring out longer-term stable IDs for the web portal entries, since now we can compare before vs. after. And, indeed, there's one part of the URL that does seem to act as a unique ID — the &ids=... URL parameter. I had been reticent to use it before, but now that we have comparison data, it looks like it's exactly what we need.

So ...

This PR uses that parameter (instead of the full report PDF URL) as the basis for the hash digest, and then migrates all the old hash_ids to new ones. I expected some gnarliness, but it actually seems to have worked quite cleanly: It successfully deduplicates all the accumulated dupes, doesn't lose any non-duplicate PDFs, and rerunning the full processing pipeline (minus new fetches) seems to work smoothly. Time will tell, of course.

FYI, pausing GitHub Actions while we sort this out.

jsvine commented 1 year ago

p.s., This PR is divided into two commits — one for the changes in logic (3c90d3e), the other for the effect (e804046).

palewire commented 1 year ago

Seems like a smart fix to me.

palewire commented 1 year ago

Seems like a smart fix to me.