The spider broke due to what appear to be security enhancements on the agency's servers. The webpage didn't change but requests were returning 403 errors. Using a headless browser and a different user agent appear to circumvent whatever bot-detection system the agency is using.
Monitor the stdout and ensure that the crawl proceeds without raising any errors. Pay attention to the final status report from scrapy.
Inspect test_output.csv to ensure the data looks valid. I suggest opening a few of the URLs under the source column of test_output.csv and comparing the data for the row with what you see on the page.
Are there any smells or added technical debt to note?
A rotating user agent would probably provide a more robust longterm solution, but we'll see how this works for now.
Adding a headless browser means that we the workflows need to install the browser each time they're run. This means their execution time will be a bit longer from this point forward. We consider it an acceptable tradeoff to ensure this spider is working properly.
To see the specific tasks where the Asana app for GitHub is being used, see below:
What's this PR do?
Fixes our Northeast Ohio Areawide Coordinating Agency spider (aka.
cuya_northeast_ohio_coordinating
).Why are we doing this?
The spider broke due to what appear to be security enhancements on the agency's servers. The webpage didn't change but requests were returning 403 errors. Using a headless browser and a different user agent appear to circumvent whatever bot-detection system the agency is using.
Steps to manually test
After installing the project using
pipenv
:Activate the virtual environment:
Run the spider:
Monitor the stdout and ensure that the crawl proceeds without raising any errors. Pay attention to the final status report from scrapy.
Inspect
test_output.csv
to ensure the data looks valid. I suggest opening a few of the URLs under the source column of test_output.csv and comparing the data for the row with what you see on the page.Are there any smells or added technical debt to note?