data-liberation-project / aphis-inspection-reports

Inspection data and PDFs from the USDA's Animal and Plant Health Inspection Service.
13 stars 3 forks source link

Anticipate potential future changes to inspection report URLs #25

Closed jsvine closed 1 year ago

jsvine commented 1 year ago

Much of the scraper's current logic depends on the reportLink associated with a given inspection to be stable over time, largely because the data provided through the APHIS portal does not include the inspection ID. But there's no guarantee that those PDF URLs will remain the same in the future.

I think there are higher-priority things to sort out at the moment, but once the repository stabilizes, I think it's worth figuring out a more future-proof approach. One might be to download and parse the PDFs on the fly (caching the result for each URL, so that this step is only required once per URL), extracting their actual inspection IDs from the documents, and then using that as the filename/ID tied to the report throughout the rest of the pipeline.

jsvine commented 1 year ago

Closed via #42