biglocalnews / warn-scraper

Command-line interface for downloading WARN Act notices of qualified plant closings and mass layoffs from state government websites
https://warn-scraper.readthedocs.io
Apache License 2.0
29 stars 10 forks source link

MO QA needed, with optimization potential #606

Open stucka opened 8 months ago

stucka commented 8 months ago

There may be an undocumented endpoint in Missouri that allows all years to be scraped on a single hit: https://jobs.mo.gov/warn/all

This would need a modicum of testing to ensure we're getting identical output to the per-year scrapes. Hitting this endpoint might reduce the chance we get snared by anti-abuse systems flagged in #597 by @kirkman because we're not hitting all the pages all the time.

stucka commented 8 months ago

Endpoint shows 49,397 layoffs from 2019.

BLN Missouri file (which may include things not scraped) shows 72,761 total, per Excel.

This is a great opportunity for some extra QA!

stucka commented 8 months ago

QA needed.

BLN version seems to show 364 entries, including combined rows for at least some of the revision entries.

/all endpoint seems to show 327 entries with separate rows for at least some of the revision entries.

stucka commented 8 months ago

Flagging @kirkman instead of the other person I flagged by accident. I need sleep.

cephillips commented 8 months ago

Couldn’t a lot of that be amendments?

Sent from my iPhone

On Jan 31, 2024, at 6:47 AM, Mike Stucka @.***> wrote:



Endpoint shows 49,397 layoffs from 2019.

BLN Missouri file (which may include things not scraped) shows 72,761 total, per Excel.

This is a great opportunity for some extra QA!

— Reply to this email directly, view it on GitHubhttps://github.com/biglocalnews/warn-scraper/issues/606#issuecomment-1919253062, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAEFU3TVUTJQP4YQQ7UN7BTYRJKOTAVCNFSM6AAAAABCTF3VTOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJZGI2TGMBWGI. You are receiving this because you are subscribed to this thread.Message ID: @.***>

stucka commented 8 months ago

Lotsa duplicates for some reason in the BLN data. If I drop the obvious duplicates I get back to 52,379 layoffs among 256 entries, so it's close to the state's sheet but not quite there.