Much of the scraper's current logic depends on the reportLink associated with a given inspection to be stable over time, largely because the data provided through the APHIS portal does not include the inspection ID. But there's no guarantee that those PDF URLs will remain the same in the future.
I think there are higher-priority things to sort out at the moment, but once the repository stabilizes, I think it's worth figuring out a more future-proof approach. One might be to download and parse the PDFs on the fly (caching the result for each URL, so that this step is only required once per URL), extracting their actual inspection IDs from the documents, and then using that as the filename/ID tied to the report throughout the rest of the pipeline.
Much of the scraper's current logic depends on the
reportLink
associated with a given inspection to be stable over time, largely because the data provided through the APHIS portal does not include the inspection ID. But there's no guarantee that those PDF URLs will remain the same in the future.I think there are higher-priority things to sort out at the moment, but once the repository stabilizes, I think it's worth figuring out a more future-proof approach. One might be to download and parse the PDFs on the fly (caching the result for each URL, so that this step is only required once per URL), extracting their actual inspection IDs from the documents, and then using that as the filename/ID tied to the report throughout the rest of the pipeline.