biglocalnews / warn-scraper

Command-line interface for downloading WARN Act notices of qualified plant closings and mass layoffs from state government websites
https://warn-scraper.readthedocs.io
Apache License 2.0
29 stars 10 forks source link

Fix TN scraper #649

Closed stucka closed 4 months ago

stucka commented 4 months ago

This needs some re-engineering, because three formats are in play and the overhead and maintenance serve no practical utility. The current scraper parses an older PDF and also scrapes data inside paragraph tags, which were used through 2023.

The new format gets marked up by JavaScript in the browsers to make parsing a bit more annoying, but is using tables and rows. This shouldn't be difficult to finish.

Archived copies of the PDF and HTML file should be zipped up into the appropriate BLN bucket, and already-parsed CSVs from a previously successful run should be downloaded on each parser run instead.

The scraper can drop the PDF dependency after that.