Last week, it seems APHIS switched Salesforce servers — or at least the URLs from which it's serving the inspection report PDF. The most obvious change is that they're serving from https://aphis--c.na107.content.force.com[...] instead of https://aphis--c.na21.content.force.com[...], although other components of the URL changed too.
Unfortunately, this broke (a) our deduplication strategy and (b) our hash_id file-naming scheme, both of, which depended on those full URLs.
In the absence of a fix, we'll end up with a bunch of duplicated entries and PDFs.
Good news
As #25 suggested, this was likely to happen eventually, and perhaps it's better that it happened sooner rather than later.
Also: The APHIS change helps in the sense that it provided a much-needed clue to figuring out longer-term stable IDs for the web portal entries, since now we can compare before vs. after. And, indeed, there's one part of the URL that does seem to act as a unique ID — the &ids=... URL parameter. I had been reticent to use it before, but now that we have comparison data, it looks like it's exactly what we need.
So ...
This PR uses that parameter (instead of the full report PDF URL) as the basis for the hash digest, and then migrates all the old hash_ids to new ones. I expected some gnarliness, but it actually seems to have worked quite cleanly: It successfully deduplicates all the accumulated dupes, doesn't lose any non-duplicate PDFs, and rerunning the full processing pipeline (minus new fetches) seems to work smoothly. Time will tell, of course.
FYI, pausing GitHub Actions while we sort this out.
Welp, #25 happened. Bad news, but also good news.
Bad news
Last week, it seems APHIS switched Salesforce servers — or at least the URLs from which it's serving the inspection report PDF. The most obvious change is that they're serving from
https://aphis--c.na107.content.force.com[...]
instead ofhttps://aphis--c.na21.content.force.com[...]
, although other components of the URL changed too.Unfortunately, this broke (a) our deduplication strategy and (b) our
hash_id
file-naming scheme, both of, which depended on those full URLs.In the absence of a fix, we'll end up with a bunch of duplicated entries and PDFs.
Good news
As #25 suggested, this was likely to happen eventually, and perhaps it's better that it happened sooner rather than later.
Also: The APHIS change helps in the sense that it provided a much-needed clue to figuring out longer-term stable IDs for the web portal entries, since now we can compare before vs. after. And, indeed, there's one part of the URL that does seem to act as a unique ID — the
&ids=...
URL parameter. I had been reticent to use it before, but now that we have comparison data, it looks like it's exactly what we need.So ...
This PR uses that parameter (instead of the full report PDF URL) as the basis for the hash digest, and then migrates all the old
hash_id
s to new ones. I expected some gnarliness, but it actually seems to have worked quite cleanly: It successfully deduplicates all the accumulated dupes, doesn't lose any non-duplicate PDFs, and rerunning the full processing pipeline (minus new fetches) seems to work smoothly. Time will tell, of course.FYI, pausing GitHub Actions while we sort this out.