Closed chriszs closed 2 months ago
Still some data quality issues to resolve:
But we're inching closer.
The HTML parser trashes the p
tags but I'm wondering if that might be contributing to some of the problems here?
https://www.dli.pa.gov/Individuals/Workforce-Development/warn/notices/Pages/April-2020.aspx
In April 2020, for example, I see the final p
tag with some additional markup contains the number of layoffs and such; earlier p
tags contain the individual locations. Parsing those as distinct entities may make it easier to handle the fields with distinction.
There's a tactical question here about how to handle this when there are multiple locations but a single group summary, particularly with the number of layoffs. (Perhaps the group as one line in the CSV; Perhaps one-location-per-row, but prefix "GROUP: " before the parsed layoff, then clean up text in transformer?)
There's also a new (and probably terribly conceived) function with utils.fetch_if_not_cached
that might make sense to use for maybe all but the three newest URLs. so we're not hitting dozens of quite old files several times a day. If adapted into the existing workflow, you'd have to fetch the three newest each time, and add their content to output_rows; then for the others not in the three newest determine the filename and URL, then run utils.fetch-if_not_cached
and cache.read
to get stuff into output_rows. But there'd be a lot less stuff in motion for repeat runs.
I tweaked the filing handling a little (e.g., March 2024 would never have redownloaded, and cached files were getting rewritten) but nothing else. I have not looked closely at the parsing.
Thanks! It's not ready for review yet.
Closing again because I haven't had a chance to work on this.
Draft PR to add a scraper for Pennsylvania.
Incorporates and builds upon @Ash1R's fixes and @stucka's edits from #517 by cherry picking their commits.
Steps to test
Closes #374