Fixes our Cuyahoga County Children and Family Services Advisory Board spider (aka. cuya_children_family_advisory), which broke due to URL and page structure changes.
[Note: This PR builds off #67, which should be reviewed first]
Why are we doing this?
We want working scrapers, of course 🤖 The changes in this PR include new target URLs and rebuilt parser methods to account for HTML changes.
Monitor the stdout and ensure that the crawl proceeds without raising any errors. Pay attention to the final status report from scrapy.
Inspect test_output.csv to ensure the data looks valid. I suggest opening a few of the URLs under the source column of test_output.csv and comparing the data for the row with what you see on the page.
Are there any smells or added technical debt to note?
Meeting information for Cuyahoga County agencies are largely on the same Cuyahoga County website and our spiders appear to use a mixin to share parsing logic. Since this spider is broken, it's likely our other Cuyahoga County spiders are broken too and the entire mixin news to be rewritten. In a future fix, I will likely do this. For now, I opted to just fix this particular spider since I knew for certain it was broken based on partner site reports.
What's this PR do?
Fixes our Cuyahoga County Children and Family Services Advisory Board spider (aka.
cuya_children_family_advisory
), which broke due to URL and page structure changes.[Note: This PR builds off #67, which should be reviewed first]
Why are we doing this?
We want working scrapers, of course 🤖 The changes in this PR include new target URLs and rebuilt parser methods to account for HTML changes.
Steps to manually test
After installing the project using
pipenv
:Activate the virtual environment:
Run the spider:
Monitor the stdout and ensure that the crawl proceeds without raising any errors. Pay attention to the final status report from scrapy.
Inspect
test_output.csv
to ensure the data looks valid. I suggest opening a few of the URLs under the source column of test_output.csv and comparing the data for the row with what you see on the page.Are there any smells or added technical debt to note?