City-Bureau / city-scrapers

Scrape, standardize and share public meetings from local government websites
https://cityscrapers.org
MIT License
332 stars 310 forks source link

🕷️ Fix spider: Illinois Pollution Control Board #1124

Closed SimmonsRitchie closed 4 months ago

SimmonsRitchie commented 4 months ago

What's this PR do?

Fixes our Illinois Pollution Control Board spider (aka. il_pollution_control).

Why are we doing this?

The spider appears to have started breaking around the time this repo's version of Scrapy was upgraded. It appears the spider is using scrapy's Crawler class in an outdated way. The changes in this PR fix the spider and simplify it: it no longer attempts to parse agendas and meeting minutes. This simplified handling is intended to ensure the scraper is less brittle to website changes.

Steps to manually test

After installing the project using pipenv:

  1. Activate the virtual environment:

    pipenv shell
  2. Run the spider:

    scrapy crawl il_pollution_control -O test_output.csv
  3. Monitor the stdout and ensure that the crawl proceeds without raising any errors. Pay attention to the final status report from scrapy.

  4. Inspect test_output.csv to ensure the data looks valid. I suggest opening a few of the URLs under the source column of test_output.csv and comparing the data for the row with what you see on the page.

Are there any smells or added technical debt to note?