biglocalnews / warn-scraper

Command-line interface for downloading WARN Act notices of qualified plant closings and mass layoffs from state government websites
https://warn-scraper.readthedocs.io
Apache License 2.0
30 stars 10 forks source link

WIP PA scraper #628

Closed chriszs closed 2 months ago

chriszs commented 8 months ago

Draft PR to add a scraper for Pennsylvania.

Incorporates and builds upon @Ash1R's fixes and @stucka's edits from #517 by cherry picking their commits.

Steps to test

python -m warn.cli PA

Closes #374

chriszs commented 8 months ago

Still some data quality issues to resolve:

Screenshot 2024-03-10 at 9 30 29 PM Screenshot 2024-03-10 at 9 30 47 PM

But we're inching closer.

stucka commented 8 months ago

The HTML parser trashes the p tags but I'm wondering if that might be contributing to some of the problems here? https://www.dli.pa.gov/Individuals/Workforce-Development/warn/notices/Pages/April-2020.aspx

In April 2020, for example, I see the final p tag with some additional markup contains the number of layoffs and such; earlier p tags contain the individual locations. Parsing those as distinct entities may make it easier to handle the fields with distinction.

There's a tactical question here about how to handle this when there are multiple locations but a single group summary, particularly with the number of layoffs. (Perhaps the group as one line in the CSV; Perhaps one-location-per-row, but prefix "GROUP: " before the parsed layoff, then clean up text in transformer?)

stucka commented 8 months ago

There's also a new (and probably terribly conceived) function with utils.fetch_if_not_cached that might make sense to use for maybe all but the three newest URLs. so we're not hitting dozens of quite old files several times a day. If adapted into the existing workflow, you'd have to fetch the three newest each time, and add their content to output_rows; then for the others not in the three newest determine the filename and URL, then run utils.fetch-if_not_cached and cache.read to get stuff into output_rows. But there'd be a lot less stuff in motion for repeat runs.

stucka commented 8 months ago

I tweaked the filing handling a little (e.g., March 2024 would never have redownloaded, and cached files were getting rewritten) but nothing else. I have not looked closely at the parsing.

chriszs commented 8 months ago

Thanks! It's not ready for review yet.

chriszs commented 2 months ago

Closing again because I haven't had a chance to work on this.