TeamHG-Memex / scrapy-rotating-proxies

use multiple proxies with Scrapy
MIT License
738 stars 158 forks source link

All proxies unchecked except one #34

Open milicamilivojevic opened 4 years ago

milicamilivojevic commented 4 years ago

Proxies(good: 1, dead: 0, unchecked: 19999, reanimated: 0, mean backoff time: 0s) Proxies(good: 0, dead: 1, unchecked: 19999, reanimated: 0, mean backoff time: 572s) I have the list of 20,000 proxies in this format: username1:password1@host:port username2:password2@host:port username3:password3@host:port username4:password4@host:port username5:password5@host:port username6:password6@host:port username7:password7@host:port All ips and ports are the same, but usernames and passwords are different. With this format only one ip is used and other are unchecked even if I have 20,000 proxies. Can you please help me?

darshanlol commented 4 years ago

Getting the same issue, my guess is it has something to do with handling proxy authorization.

tadaoisgod commented 4 years ago

@darshanlol is right, it involves proxy authorization, but not directly.

The problem is in the get_proxy method from expire.py: if you are using authorized IP addresses, this method will retrieve only the root domain from the proxy. The Scrapy driver removes the authorization part of the proxy address and it messes up the good/dead/unchecked marking.

I just added another key to the request.meta with the original proxy IP and it worked. It's ultimately a workaround to the aforementioned issue, but it works. These are the changes I made in the middlewares.py file, on the RotatingProxyMiddleware class.

    def process_request(self, request, spider):
        if 'proxy' in request.meta and not request.meta.get('_rotating_proxy'):
            return
        proxy = self.proxies.get_random()
        if not proxy:
            if self.stop_if_no_proxies:
                raise CloseSpider("no_proxies")
            else:
                logger.warn("No proxies available; marking all proxies "
                            "as unchecked")
                self.proxies.reset()
                proxy = self.proxies.get_random()
                if proxy is None:
                    logger.error("No proxies available even after a reset.")
                    raise CloseSpider("no_proxies_after_reset")

        request.meta['proxy'] = proxy
        request.meta['download_slot'] = self.get_proxy_slot(proxy)
        request.meta['_rotating_proxy'] = True
        request.meta['_original_proxy_url'] = proxy    # adding new variable here

...

    def _handle_result(self, request, spider):
        proxy = request.meta.get("_original_proxy_url", None)      # changing proxy variable to grab from request.meta
        if not (proxy and request.meta.get("_rotating_proxy")):
            return
        self.stats.set_value(
            "proxies/unchecked",
            len(self.proxies.unchecked) - len(self.proxies.reanimated),
        )
        self.stats.set_value("proxies/reanimated", len(self.proxies.reanimated))
        self.stats.set_value("proxies/mean_backoff", self.proxies.mean_backoff_time)
        ban = request.meta.get("_ban", None)
        if ban is True:
            self.proxies.mark_dead(proxy)
            self.stats.set_value("proxies/dead", len(self.proxies.dead))
            return self._retry(request, spider)
        elif ban is False:
            self.proxies.mark_good(proxy)
            self.stats.set_value("proxies/good", len(self.proxies.good))
3hhh commented 4 years ago

I can confirm this one.

It seems that requests go out with the different usernames & passwords initially even though the reporting doesn't work.

However retries tend to use the same bad proxy again and again essentially making them useless. So this one is not just about reporting. I'm not sure whether the above workaround also helps with that.

Maybe it's because [1] removes user & password from request.meta['proxy'], which is then later returned with the response and hits [2] incl. the _retry().

[1] https://docs.scrapy.org/en/latest/_modules/scrapy/downloadermiddlewares/httpproxy.html#HttpProxyMiddleware [2] https://github.com/TeamHG-Memex/scrapy-rotating-proxies/blob/master/rotating_proxies/middlewares.py#L161

3hhh commented 4 years ago

I guess this one is not too uncommon as rotating proxy providers tend to implement APIs via the usernames.