TeamHG-Memex / scrapy-rotating-proxies

use multiple proxies with Scrapy
MIT License
738 stars 158 forks source link

None of the Proxies are checked - leading to perpetual process where scraping never starts #45

Open caffeinatedMike opened 4 years ago

caffeinatedMike commented 4 years ago

Also, another side note. It seems this middleware is not respecting the normal shutdown signal Scrapy sends, forcing a user to force an unclean shutdown 2020-07-31 10:03:05 [scrapy.crawler] INFO: Received SIGINT, shutting down gracefully. Send again to force 5 minutes between when signal is received before I forced unclean shutdown 2020-07-31 10:08:42 [scrapy.crawler] INFO: Received SIGINT twice, forcing unclean shutdown

Note: This is a site that my current IP is blocked, so I suspect that is the root cause. However, I think it'd be a good idea to have some sort of detection in this middleware to notice that the site is blocking all requests and output this in the logs.

Logs

2020-07-31 10:00:57 [scrapy.core.engine] INFO: Spider opened
2020-07-31 10:00:58 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-07-31 10:00:58 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-07-31 10:00:58 [rotating_proxies.middlewares] INFO: Proxies(good: 0, dead: 0, unchecked: 25, reanimated: 0, mean backoff time: 0s)
2020-07-31 10:01:28 [rotating_proxies.middlewares] INFO: Proxies(good: 0, dead: 0, unchecked: 25, reanimated: 0, mean backoff time: 0s)
2020-07-31 10:01:58 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-07-31 10:01:58 [rotating_proxies.middlewares] INFO: Proxies(good: 0, dead: 0, unchecked: 25, reanimated: 0, mean backoff time: 0s)
2020-07-31 10:02:28 [rotating_proxies.middlewares] INFO: Proxies(good: 0, dead: 0, unchecked: 25, reanimated: 0, mean backoff time: 0s)
2020-07-31 10:02:58 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-07-31 10:02:58 [rotating_proxies.middlewares] INFO: Proxies(good: 0, dead: 0, unchecked: 25, reanimated: 0, mean backoff time: 0s)
2020-07-31 10:03:05 [scrapy.crawler] INFO: Received SIGINT, shutting down gracefully. Send again to force 
2020-07-31 10:03:05 [scrapy.core.engine] INFO: Closing spider (shutdown)
2020-07-31 10:03:28 [rotating_proxies.middlewares] INFO: Proxies(good: 0, dead: 0, unchecked: 25, reanimated: 0, mean backoff time: 0s)
2020-07-31 10:03:58 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-07-31 10:03:58 [rotating_proxies.middlewares] INFO: Proxies(good: 0, dead: 0, unchecked: 25, reanimated: 0, mean backoff time: 0s)
2020-07-31 10:03:58 [rotating_proxies.expire] DEBUG: Proxy <http://XXXXXXXXXXX:8800> is DEAD
2020-07-31 10:03:58 [rotating_proxies.middlewares] DEBUG: Retrying <GET https://www.kroger.com/robots.txt> with another proxy (failed 1 times, max retries: 5)
2020-07-31 10:04:28 [rotating_proxies.middlewares] INFO: Proxies(good: 0, dead: 1, unchecked: 24, reanimated: 0, mean backoff time: 253s)
2020-07-31 10:04:58 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-07-31 10:04:58 [rotating_proxies.middlewares] INFO: Proxies(good: 0, dead: 1, unchecked: 24, reanimated: 0, mean backoff time: 253s)
2020-07-31 10:05:28 [rotating_proxies.middlewares] INFO: Proxies(good: 0, dead: 1, unchecked: 24, reanimated: 0, mean backoff time: 253s)
2020-07-31 10:05:58 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-07-31 10:05:58 [rotating_proxies.middlewares] INFO: Proxies(good: 0, dead: 1, unchecked: 24, reanimated: 0, mean backoff time: 253s)
2020-07-31 10:06:28 [rotating_proxies.middlewares] INFO: Proxies(good: 0, dead: 1, unchecked: 24, reanimated: 0, mean backoff time: 253s)
2020-07-31 10:06:58 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-07-31 10:06:58 [rotating_proxies.middlewares] INFO: Proxies(good: 0, dead: 1, unchecked: 24, reanimated: 0, mean backoff time: 253s)
2020-07-31 10:06:58 [rotating_proxies.expire] DEBUG: Proxy <http://XXXXXXXXXXX:8800> is DEAD
2020-07-31 10:06:58 [rotating_proxies.middlewares] DEBUG: Retrying <GET https://www.kroger.com/robots.txt> with another proxy (failed 2 times, max retries: 5)
2020-07-31 10:07:28 [rotating_proxies.middlewares] INFO: Proxies(good: 0, dead: 2, unchecked: 23, reanimated: 0, mean backoff time: 214s)
2020-07-31 10:07:58 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-07-31 10:07:58 [rotating_proxies.middlewares] INFO: Proxies(good: 0, dead: 2, unchecked: 23, reanimated: 0, mean backoff time: 214s)
2020-07-31 10:08:13 [rotating_proxies.middlewares] DEBUG: 1 proxies moved from 'dead' to 'reanimated'
2020-07-31 10:08:28 [rotating_proxies.middlewares] INFO: Proxies(good: 0, dead: 1, unchecked: 23, reanimated: 1, mean backoff time: 175s)
2020-07-31 10:08:42 [scrapy.crawler] INFO: Received SIGINT twice, forcing unclean shutdown
2020-07-31 10:08:42 [rotating_proxies.expire] DEBUG: Proxy <http://XXXXXXXXXXX:8800> is DEAD
2020-07-31 10:08:42 [rotating_proxies.middlewares] DEBUG: Retrying <GET https://www.kroger.com/robots.txt> with another proxy (failed 3 times, max retries: 5)