amitupreti / Hands-on-WebScraping

This repo is a part of blog series on several web scraping projects where we will explore scraping techniques to crawl data from simple websites to websites using advanced protection.
MIT License
82 stars 74 forks source link

HTTP Status Code Is Not Handled Or Not Allowed #12

Open Huntley30 opened 3 years ago

Huntley30 commented 3 years ago

Uh oh...did Twitter break us? Do we have the change the user_agent in settings.py?

<021-09-09 15:34:55 [scrapy.core.engine] INFO: Spider opened 2021-09-09 15:34:55 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2021-09-09 15:34:55 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023 2021-09-09 15:34:55 [root] INFO: 3 hashtags found 2021-09-09 15:34:55 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 https://mobile.twitter.com/hashtag/cats>: HTTP status code is not handled or not allowed 2021-09-09 15:34:55 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 https://mobile.twitter.com/hashtag/dogs>: HTTP status code is not handled or not allowed 2021-09-09 15:34:55 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 https://mobile.twitter.com/hashtag/hello>: HTTP status code is not handled or not allowed 2021-09-09 15:34:55 [scrapy.core.engine] INFO: Closing spider (finished)>

michael-pagan commented 3 years ago

I was able to semi-fix this by updating the USER_AGENT to on line 17 in hashtag.py to 'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:48.0) Gecko/20100101 Firefox/48.0', as suggested by this StackOverflow post.

The issue remains, however, that no tweets are found, which appears to be incorrect.