City-Bureau / city-scrapers-cle

City Scrapers project for Cleveland
https://cityscrapers.org/
MIT License
15 stars 14 forks source link

🏗️ Add spider: Cleveland Police Commission #83

Closed SimmonsRitchie closed 9 months ago

SimmonsRitchie commented 9 months ago

What's this PR do?

Adds a spider for the Cleveland Police Commission (known as cle_cpc).

Why are we doing this?

Requested by City Bureau's partner site in Cleveland.

Steps to manually test

After installing the project using pipenv (see Readme):

  1. Activate the virtual environment:

    pipenv shell
  2. Run the spider:

    scrapy crawl cle_cpc -O test_output.csv
  3. Monitor the stdout and ensure that the crawl proceeds without raising any errors. Pay attention to the final status report from scrapy.

  4. Inspect test_output.csv to ensure the data looks valid. I suggest opening a few of the URLs under the source column of test_output.csv and comparing the data for the row with what you see on the page.

Are there any smells or added technical debt to note?

Common detail pages The target website is pretty quirky. Meeting titles and meeting detail pages are shared across committees of the same type. For instance, at time of writing, all three "Police Discipline Group" events that are displayed link to the same detail page:

image

The detail page itself contains largely generic meeting information and the meeting date refers to the next scheduled meeting.

image

To account for these quirks, we simple scrape only the first unique committee meeting we come across on the calendar. The downside is that we're scraping fewer upcoming meetings but the upside is that we hopefully are scraping only information that relates to the next upcoming meeting and avoid generating confusing data.

Common titles As a compounding quirk, each committee appears to share a common meeting title across all scheduled meetings on the calendar page. This creates a particularly confusing user experience if a committee adds a special note in its title. Note the two February "BEHAVIORAL HEALTH & CRISIS INTERVENTION WORK GROUP" meetings below where a note is noted regarding a cancelled January meeting that appears to no longer be relevant to the meetings themselves:

image

To handle this, this spider overrides CityScrapersSpider's _get_status method, which would otherwise classify both meetings as cancelled because they contain the word "cancelled". The downside is that we risk treating cancelled meetings as tentative meetings, but this seemed preferable to having scheduled meetings incorrectly classified as cancelled.