Adds a spider for the Cleveland Police Commission (known as cle_cpc).
Why are we doing this?
Requested by City Bureau's partner site in Cleveland.
Steps to manually test
After installing the project using pipenv (see Readme):
Activate the virtual environment:
pipenv shell
Run the spider:
scrapy crawl cle_cpc -O test_output.csv
Monitor the stdout and ensure that the crawl proceeds without raising any errors. Pay attention to the final status report from scrapy.
Inspect test_output.csv to ensure the data looks valid. I suggest opening a few of the URLs under the source column of test_output.csv and comparing the data for the row with what you see on the page.
Are there any smells or added technical debt to note?
Common detail pages
The target website is pretty quirky. Meeting titles and meeting detail pages are shared across committees of the same type. For instance, at time of writing, all three "Police Discipline Group" events that are displayed link to the same detail page:
The detail page itself contains largely generic meeting information and the meeting date refers to the next scheduled meeting.
To account for these quirks, we simple scrape only the first unique committee meeting we come across on the calendar. The downside is that we're scraping fewer upcoming meetings but the upside is that we hopefully are scraping only information that relates to the next upcoming meeting and avoid generating confusing data.
Common titles
As a compounding quirk, each committee appears to share a common meeting title across all scheduled meetings on the calendar page. This creates a particularly confusing user experience if a committee adds a special note in its title. Note the two February "BEHAVIORAL HEALTH & CRISIS INTERVENTION WORK GROUP" meetings below where a note is noted regarding a cancelled January meeting that appears to no longer be relevant to the meetings themselves:
To handle this, this spider overrides CityScrapersSpider's _get_status method, which would otherwise classify both meetings as cancelled because they contain the word "cancelled". The downside is that we risk treating cancelled meetings as tentative meetings, but this seemed preferable to having scheduled meetings incorrectly classified as cancelled.
What's this PR do?
Adds a spider for the Cleveland Police Commission (known as
cle_cpc
).Why are we doing this?
Requested by City Bureau's partner site in Cleveland.
Steps to manually test
After installing the project using
pipenv
(see Readme):Activate the virtual environment:
Run the spider:
Monitor the stdout and ensure that the crawl proceeds without raising any errors. Pay attention to the final status report from scrapy.
Inspect
test_output.csv
to ensure the data looks valid. I suggest opening a few of the URLs under the source column of test_output.csv and comparing the data for the row with what you see on the page.Are there any smells or added technical debt to note?
Common detail pages The target website is pretty quirky. Meeting titles and meeting detail pages are shared across committees of the same type. For instance, at time of writing, all three "Police Discipline Group" events that are displayed link to the same detail page:
The detail page itself contains largely generic meeting information and the meeting date refers to the next scheduled meeting.
To account for these quirks, we simple scrape only the first unique committee meeting we come across on the calendar. The downside is that we're scraping fewer upcoming meetings but the upside is that we hopefully are scraping only information that relates to the next upcoming meeting and avoid generating confusing data.
Common titles As a compounding quirk, each committee appears to share a common meeting title across all scheduled meetings on the calendar page. This creates a particularly confusing user experience if a committee adds a special note in its title. Note the two February "BEHAVIORAL HEALTH & CRISIS INTERVENTION WORK GROUP" meetings below where a note is noted regarding a cancelled January meeting that appears to no longer be relevant to the meetings themselves:
To handle this, this spider overrides
CityScrapersSpider
's _get_status method, which would otherwise classify both meetings as cancelled because they contain the word "cancelled". The downside is that we risk treating cancelled meetings as tentative meetings, but this seemed preferable to having scheduled meetings incorrectly classified as cancelled.