data-liberation-project / aphis-inspection-reports

Inspection data and PDFs from the USDA's Animal and Plant Health Inspection Service.
13 stars 3 forks source link

Add weekly job to fully refresh the listings #48

Closed jsvine closed 1 year ago

jsvine commented 1 year ago

Due to the constraints of the APHIS portal, this repository uses two approaches to fetch new inspection listings:

1) Fetch the 2,000 most recent public inspections, the maximum allowed for a given filter on the database. This approach works well and quickly, but has one drawback: It cannot see any newly added inspections that are older than the 2,000th-oldest inspection. That usually doesn't matter, because we're running this refresh daily, but occasionally APHIS posts an old inspection we haven't seen before.

2) Refetch the entire listing of inspections. This approach is convoluted, due to the APHIS portal's constraits. It involves iterating over every substring ("a", "aa", "ab", etc.) until we reach sub-2,000-result queries. It takes much longer, and only occasionally finds an inspection we haven't seen with the other approach.

So far, the GitHub actions have only been running (1). This commit adds a new workflow for (2) and has it run weekly on Sundays at 12:00 GMT.

Note: It only refreshes the listings, but does not run the other parts of the processing, to avoid having those other steps defined in two different places. But open to difference of opinion on that!

palewire commented 1 year ago

I like it.