biglocalnews / warn-scraper

Command-line interface for downloading WARN Act notices of qualified plant closings and mass layoffs from state government websites
https://warn-scraper.readthedocs.io
Apache License 2.0
28 stars 10 forks source link

Update ID scraper to use state's new URL #644

Open chriszs opened 3 months ago

chriszs commented 3 months ago

Idaho moved its warn PDF from https://www.labor.idaho.gov/dnn/Portals/0/Publications/WARNNotice.pdf to https://www.labor.idaho.gov/wp-content/uploads/publications/WARNNotice.pdf. The scraper follows this transparently, so there's no breakage, but seems like a good policy to update the URL to reflect the current location.

chriszs commented 3 months ago

One note here: the state's page linking to this file actually links to https://www.labor.idaho.gov/warnnotice/ which does a redirect to the PDF with a note that says, parenthetically, "link is updated as notices are received." That reads to me like the file is updated continuously, but it could also mean they change the link on a semi-regular basis. So, we have a couple options:

  1. retrieve the file at the current URL of the PDF
  2. retain the current behavior and rely on the redirect from the file's old URL
  3. rely on the /warnnotice/ redirect
  4. scrape the HTML page to know which URL to check

I think it's probably a crap shoot, but the simplest thing to do to improve the situation might be #1.