Monitor the stdout and ensure that the crawl proceeds without raising any errors. Pay attention to the final status report from scrapy.
Inspect test_output.csv to ensure the data looks valid. I suggest taking a cursory look at the target webpage and clicking through to a few of the meeting detail pages in order to spot check the data.
Are there any smells or added technical debt to note?
The prior spider included handling for a setting calling "CITY_SCRAPERS_ARCHIVE", which I have dropped. This setting appears to be somewhat unique to this particular city-scrapers repo (I don't see it used in our Philly or Akron repos) and I believe was added to this repo to provide bespoke behavior to spiders that scraped very old meeting information. Because we're only scraping very recent meeting data with this rebuilt spider, I think we can safely exclude it. A number of other spiders in this repo also appear to exclude it, so I think this is a safe call.
Not a smell but a small note: This agency is a little unusual in that it only displays meeting info on past meetings and the next immediate available meeting. Upcoming meetings beyond that aren't on the page. Still, a sole upcoming meeting is valuable for our Documenters and past meeting data seems to typically include a link to a meeting video, which is also likely to be valuable to our partner sites.
What's this PR do?
Fixes our Chicago Board of Elections spider (aka.
chi_board_elections
), which broke due to URL and page structure changes.Why are we doing this?
We want working scrapers, of course 🤖 The changes in this PR include a rebuilt spider and tests.
Steps to manually test
After installing the project using
pipenv
(see Readme):Activate the virtual environment:
Run the spider:
Monitor the stdout and ensure that the crawl proceeds without raising any errors. Pay attention to the final status report from scrapy.
Inspect
test_output.csv
to ensure the data looks valid. I suggest taking a cursory look at the target webpage and clicking through to a few of the meeting detail pages in order to spot check the data.Are there any smells or added technical debt to note?