biglocalnews / warn-scraper

Command-line interface for downloading WARN Act notices of qualified plant closings and mass layoffs from state government websites
https://warn-scraper.readthedocs.io
Apache License 2.0
29 stars 10 forks source link

Ms scraper #373 #508

Open Ash1R opened 1 year ago

Ash1R commented 1 year ago

This is for issue #373 , to add a scraper for the Mississippi (never spelt that one wrong...).

Works correctly, but around 8 rows have two of the values switched, all for the same reason. Should I fix that or leave it for downstream?

Ash1R commented 1 year ago

My bad, done! I copied the current mi.py code.

stucka commented 1 year ago

Triggering tests by closing and reopening.

stucka commented 11 months ago

OK, so for the record I've done some terrible things to @Ash1R 's draft, and hope to do more soon and get this into production.

To-do:

stucka commented 10 months ago

@Ash1R , I've got a bunch more validation in the scraper. I incorporated the fixes made by @jsvine but then had to go farther off the reservation to patch an even weirder PDF. Still need to set up some of the historical data but first need to do some validation of the CSV. Looks like it picked up about 30 more rows than you were getting, which is ... weird.

ms.csv

stucka commented 10 months ago

Seeing some data integrity problems with edge cases that bump up against the logic of "every other row has the layoff number" kind of thing. A good example: https://mdes.ms.gov/media/26893/PY2011_Q1_WARN_July2011_Sep2011.pdf

Another way to handle this might be able to split the rows up into sections (e.g., every section must have a "/" in the first cell of the first row, to show a date). That's likely overkill.

stucka commented 9 months ago

The PDF parsing is still failing in some interesting ways. I tried to get the historical data cleaned up but found most of a page missing, e.g., 152801_py2018_q4_warn_apr2019_jun2019.pdf

I tweaked a couple things in the Python to try to improve logging and readability in the output, but it does not affect the substance, only the sort order.

Somewhat patched CSV: ms.csv

Note pages set to "manual," which I only started after patching some in 2013-2015.