TeamHG-Memex / scrapy-rotating-proxies

use multiple proxies with Scrapy
MIT License
738 stars 158 forks source link

Shouldn't an IgnoreRequest being raised if max_proxies_to_try is reached #51

Open codekoriko opened 4 years ago

codekoriko commented 4 years ago

I implemented a ban_policy to mark redirect 302 as a "ban".

But once the request reached the maximum retries it is let through and therefor picked-up by scrapy.downloadermiddlewares.redirect

Which in turn restart a max_proxies_to_try cycle the redirected request (a useless captacha page.)

2020-10-02 05:31:07 [rotating_proxies.middlewares] DEBUG: Gave up retrying <GET http://www.url.com> (failed 6 times with different proxies)
2020-10-02 05:31:07 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET http://www.url.com/redirected/to/captacha> from <GET http://www.url.com>
2020-10-02 05:31:10 [rotating_proxies.middlewares] DEBUG: Gave up retrying <GET http://www.url.com/redirected/to/captacha> (failed 6 times with different proxies)

Shouldn't we add a raise IgnoreRequest() like so:

 def _retry(self, request, spider):
        retries = request.meta.get('proxy_retry_times', 0) + 1
        max_proxies_to_try = request.meta.get('max_proxies_to_try',
                                              self.max_proxies_to_try)

        if retries <= max_proxies_to_try:
            logger.debug("Retrying %(request)s with another proxy "
                         "(failed %(retries)d times, "
                         "max retries: %(max_proxies_to_try)d)",
                         {'request': request, 'retries': retries,
                          'max_proxies_to_try': max_proxies_to_try},
                         extra={'spider': spider})
            retryreq = request.copy()
            retryreq.meta['proxy_retry_times'] = retries
            retryreq.dont_filter = True
            return retryreq
        else:
            logger.debug("Gave up retrying %(request)s (failed %(retries)d "
                         "times with different proxies)",
                         {'request': request, 'retries': retries},
                         extra={'spider': spider})
            raise IgnoreRequest("Max retries reached")