Fix DC: Coax agency to correct broken HTML

zstumgoren commented 2 years ago

The DC WARN pages appear to have changed. Scraper is raising the below error. Current status of the WARN pages listed on the 2021 page are:

2021-2018 work
2017 and earlier are broken, except for 2014, which appears to duplicate the 2018 data
[x] We should call the DC Dept. of Employment Services about the status of their WARN pages.
[ ] We'll need to update the scraper based on their response. If they do not respond, we should update the scraper to only pull data for 2018 to present
[ ] Restore DC to list of states to scrape in Prefect settings for WARN Update project

Stacktrace

Traceback (most recent call last):
  File "/Users/tumgoren/code/stanford/biglocal/WARN/warn/cli.py", line 67, in main
    runner.scrape(state)
  File "/Users/tumgoren/code/stanford/biglocal/WARN/warn/runner.py", line 46, in scrape
    output_csv = state_mod.scrape(self.output_dir, self.working_dir)
  File "/Users/tumgoren/code/stanford/biglocal/WARN/warn/scrapers/dc.py", line 43, in scrape
    for url in url_list:
IndexError: list index out of range

zstumgoren commented 2 years ago

Left a voice message at media inquiry number (202) 671-1904 and sent email to does@dc.gov

zstumgoren commented 2 years ago

DC has been removed from the list of states to scrape in Prefect, pending feedback from DC Dept of Employment Services and final implementation of bugfix.

zstumgoren commented 2 years ago

Followed up with DOES contacts via phone and email and tried various other departments. No luck so far.

Discovered that DC has American Jobs Center locations. They don't appear to post data on the DC job center site, but they may be a better starting point to locate whoever manages the data for DC. See below links for details and contact info:

zstumgoren commented 2 years ago

Reached out to mayor's office. Person there took info and said they'd have someone from DC Comms dept reach out...

zstumgoren commented 2 years ago

Got a callback from James Clopton in Rapid Response department. He's in charge of notifying the site maintainers in pubilc affairs dept about new filings. They in turn maintain the pages. He said they just recently (in the last week) switched to a new site, and he didn't realize the pages were broken. He notified public affairs folks about the breakage and said he'd pass my info along.

zstumgoren commented 2 years ago

No fixes have been applied yet. Pinged James Clopton today. Awaiting response...

zstumgoren commented 2 years ago

Our agency contact reached out to say they're finalizing updates to the pages prior to publishing. No official ETA, but sounds like we're getting closer...

palewire commented 2 years ago

In #385, while I was patching the CSV writing method, I added a simple hack to help this scraper work as we wait for a response.

https://github.com/biglocalnews/warn-scraper/blob/main/warn/scrapers/dc.py#L50-L64

zstumgoren commented 2 years ago

DC data pages are now restored for 2012 through a newly posted 2022 page. However, 2014 still points to the 2018 page.

Also worth noting: The URL patterns remain all over the place (i.e. no regular pattern), so we'll need to scrape links from most recently available year.

I've notified the agency but I think we could update scrapers to start scraping from 2015 onward for now until the 2014 issue is resolved.

/cc @palewire

chriszs commented 2 years ago

On a 2017 copy of the page in Archive.org, the URL 2014 now points to redirects to a URL specific to 2014, but it appears to be no longer available at that URL.

zstumgoren commented 2 years ago

@chriszs Latest from DC contact:

It seems there may be more [pages?] dropping off. That [2014] page was lost in conversion and unrecoverable. I am waiting on feedback, but will follow up when I find out. Thanks and have a great weekend!

I'll pass along the Archive.org page you uncovered. Perhaps they can restore it from that page if they can confirm the accuracy of notices listed there...

biglocalnews / warn-scraper

Fix DC: Coax agency to correct broken HTML #238

Stacktrace