Fixes our Cleveland Design Review Advisory Committees spider (aka. cle_design_review). Also tweaks the description text for meetings.
Why are we doing this?
We want working scrapers, of course 🤖 This spider was breaking because when it parsed datetime information from the page it expect weekdays (eg. Thursday) to always be pluralized. This adjusts the handling so it can handle pluralized or non-pluralized forms.
It also reworks the description for each meeting so it's simpler and less dated (no longer are we referencing covid).
Steps to manually test
After installing the project using pipenv:
Activate the virtual environment:
pipenv shell
Run the spider:
scrapy crawl cle_design_review -O test_output.csv
Monitor the stdout and ensure that the crawl proceeds without raising any errors. Pay attention to the final status report from scrapy.
Inspect test_output.csv to ensure the data looks valid. I suggest opening a few of the URLs under the source column of test_output.csv and comparing the data for the row with what you see on the page.
Are there any smells or added technical debt to note?
What's this PR do?
Fixes our Cleveland Design Review Advisory Committees spider (aka.
cle_design_review
). Also tweaks thedescription
text for meetings.Why are we doing this?
We want working scrapers, of course 🤖 This spider was breaking because when it parsed datetime information from the page it expect weekdays (eg.
Thursday
) to always be pluralized. This adjusts the handling so it can handle pluralized or non-pluralized forms.It also reworks the
description
for each meeting so it's simpler and less dated (no longer are we referencing covid).Steps to manually test
After installing the project using
pipenv
:Activate the virtual environment:
Run the spider:
Monitor the stdout and ensure that the crawl proceeds without raising any errors. Pay attention to the final status report from scrapy.
Inspect
test_output.csv
to ensure the data looks valid. I suggest opening a few of the URLs under the source column of test_output.csv and comparing the data for the row with what you see on the page.Are there any smells or added technical debt to note?