TeamHG-Memex / scrapy-rotating-proxies

use multiple proxies with Scrapy
MIT License
736 stars 157 forks source link

Track dead/alive proxies with authentication #7

Closed petermoore14 closed 7 years ago

petermoore14 commented 7 years ago

The built-in HttpProxyMiddleware will correctly set up the authentication parameters for proxies sent up from the rotating proxies plugin. However, by doing so, request.meta['proxy'] field is changed to only contain the raw proxy_url, as the credentials are ripped out. When the response is received, rotating proxies will try to mark the proxy as good or dead, but will silently fail because of the 'proxy not in self.proxies' check, resulting in all proxies staying unmarked forever. This behaviour can be verified by using any proxy with authentication and observing that the logstats keeps logging everything as unchecked while scrapy is crawling.

Easy fix is to update '_handle_result' to identify the proxy in self.proxies corresponding to the input request.meta['proxy'], and use this unabridged proxy in the rest of the call. Will make a PR for this unless you have any objections to this approach

kmike commented 7 years ago

A good catch. I haven't checked at all how this package works with proxies+auth. Your proposed fix sounds fine to me. Proably a minor point, but I'd like to avoid O(N) scanning at each request, i.e. it may be better to build the short-long mapping at startup.

petermoore14 commented 7 years ago

Ok cool, that makes sense. I'll update my PR to use a dict instead to speed things up.

petermoore14 commented 7 years ago

Updated with a hostport->proxies map to optimize retrieval.

kmike commented 7 years ago

Fixed by https://github.com/TeamHG-Memex/scrapy-rotating-proxies/pull/8 - thanks @petermoore14!