Fixes our Cuyahoga County Office of Homeless Services Advisory Board spider (aka. cuya_homeless_services), which broke due to URL and page structure changes.
Why are we doing this?
We want working scrapers, of course 🤖 The changes in this PR include use of a new class mixin to handle the new pages. It also includes tweaks to the the start date parsing of mixin so that it can fail gracefully if parsing is unsuccessful.
Steps to manually test
After installing the project using pipenv (see Readme):
Monitor the stdout and ensure that the crawl proceeds without raising any errors. Pay attention to the final status report from scrapy.
Inspect test_output.csv to ensure the data looks valid. I suggest opening a few of the URLs under the source column of test_output.csv and comparing the data for the row with what you see on the page.
Are there any smells or added technical debt to note?
What's this PR do?
Fixes our Cuyahoga County Office of Homeless Services Advisory Board spider (aka.
cuya_homeless_services
), which broke due to URL and page structure changes.Why are we doing this?
We want working scrapers, of course 🤖 The changes in this PR include use of a new class mixin to handle the new pages. It also includes tweaks to the the start date parsing of mixin so that it can fail gracefully if parsing is unsuccessful.
Steps to manually test
After installing the project using
pipenv
(see Readme):Activate the virtual environment:
Run the spider:
Monitor the stdout and ensure that the crawl proceeds without raising any errors. Pay attention to the final status report from scrapy.
Inspect
test_output.csv
to ensure the data looks valid. I suggest opening a few of the URLs under the source column of test_output.csv and comparing the data for the row with what you see on the page.Are there any smells or added technical debt to note?