Fixes our Illinois Commerce Commission spider (aka. il_commerce), which broke due to URL and HTML changes on the target webpage.
Why are we doing this?
We want working scrapers, of course 🤖 The changes in this PR include targeting a slightly different URL and using new CSS selectors.
Steps to manually test
After installing the project using pipenv (see Readme):
Run:
scrapy crawl il_commerce -O test_output.csv
Monitor the stdout and ensure that the crawl proceeds without raising any errors. Pay attention to the final status report from scrapy.
Inspect test_output.csv to ensure the data looks valid. I suggest taking a cursory look at the target webpage and clicking through to a few of the meeting detail pages in order to spot check the data.
Are there any smells or added technical debt to note?
I set the query params in our target URL to return a page with all meetings scheduled over the next 32 days. This choice was a bit arbitrary on my part. As a new maintainer of this repo, it's a bit unclear to me if we have a general timeframe we like to target. To my mind, a month of meeting data seemed reasonable to me. Happy to adjust this in response to PR feedback.
What's this PR do?
Fixes our Illinois Commerce Commission spider (aka.
il_commerce
), which broke due to URL and HTML changes on the target webpage.Why are we doing this?
We want working scrapers, of course 🤖 The changes in this PR include targeting a slightly different URL and using new CSS selectors.
Steps to manually test
After installing the project using
pipenv
(see Readme):Run:
Monitor the stdout and ensure that the crawl proceeds without raising any errors. Pay attention to the final status report from scrapy.
Inspect
test_output.csv
to ensure the data looks valid. I suggest taking a cursory look at the target webpage and clicking through to a few of the meeting detail pages in order to spot check the data.Are there any smells or added technical debt to note?