KY scraper not downloading - Githubissues

biglocalnews / warn-scraper

Command-line interface for downloading WARN Act notices of qualified plant closings and mass layoffs from state government websites

https://warn-scraper.readthedocs.io

Apache License 2.0

29 stars 10 forks source link

KY scraper not downloading #557

Closed stucka closed 6 months ago

stucka commented 1 year ago

Kentucky's scraper relies on an Excel snapshot that's years out of date now.

Looks like the newer stuff is kept here: https://kcc.ky.gov/Pages/News.aspx

simple approach should look something like

baseurl = "https://kcc.ky.gov"
starturl = "https://kcc.ky.gov/Pages/News.aspx"

requests.get(starturl)
html = r.text
subpage = html.split("WARN Notices by Year</h3>"[-1]
excelurl = baseurl + BeautifulSoup(subpage).find("a")['href']

stucka commented 1 year ago

Have a patch to get the URL to works, but the code stack does not support .xlsx files. Need to either maybe specify an older version of xlrd that does, or to switch to openpyxl or something else.

stucka commented 1 year ago

openpyxl is in the requirements and other scrapers appear to use it.

stucka commented 1 year ago

ky.py.txt Incremental backup here as I can't commit to a branch without passing all the tests

stucka commented 1 year ago

Historical data has been normalized at https://storage.googleapis.com/bln-data-public/warn-layoffs/ky-historical-normalized.csv

Historical data has been copied from Kentucky's site and archived at https://storage.googleapis.com/bln-data-public/warn-layoffs/ky-original-1998-2016.xlsx

Jupyter Notebook for normalizing the data is here -- saved in a non-public project, so here for the record. ky-history-er.ipynb.txt

stucka commented 1 year ago

ky.py-scraper.txt

Latest draft of scraper archived here. Hasn't passed tests yet.

stucka commented 6 months ago

Closed with https://github.com/biglocalnews/warn-scraper/commit/44d6c2c89d4e8c9929d0104ef2ebb83452742c0e apparently