TeamHG-Memex / scrapy-rotating-proxies

use multiple proxies with Scrapy
MIT License
738 stars 160 forks source link

Proxies Stuck in unchecked state #12

Open john-parton opened 7 years ago

john-parton commented 7 years ago

After running the crawler for over a day, I still have a lot of proxies in the "unchecked" state.

[rotating_proxies.middlewares] INFO: Proxies(good: 147, dead: 3226, unchecked: 524, reanimated: 167, mean backoff time: 4254s)

It looks like those 524 unchecked proxies are just timing out, but they're not getting moved to dead, so a lot of time is wasted sending requests to them.

I set my timeout pretty low with DOWNLOAD_TIMEOUT = 15.

Let me know if you need anything from me: parts of my crawler, settings, etc.

Thanks.

Edit: I have the BanDetectionMiddleware installed.

DOWNLOADER_MIDDLEWARES = {
    'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,
    'rotating_proxies.middlewares.BanDetectionMiddleware': 620,
}
peterlupu commented 6 years ago

I have this problem too, lots of unchecked proxies, but I have no dead ones.

[rotating_proxies.middlewares] INFO: Proxies(good: 97, dead: 0, unchecked: 97, reanimated: 6, mean backoff time: 0s)

Edit:

I think the 'problem' is that the proxy is loaded randomly between good and unchecked ones.

The main issue here is that I have a DOWNLOAD_DELAY set to 1000 seconds, according to the docs, it should be set per-proxy now. So am I wrong when saying that, in theory, if I have 100 proxies, they should each begin with a request and then every one of them has its own 1000 second delay?

If so, getting a new proxy randomly from good un unchecked ones would slow down the spider. In theory you could end up randomly getting only 10 of the 100 proxies each time get_random() is called, so you'd wait 1000 seconds per each of the the 10 proxies, and having 90 unused proxies.

Thoughts on this?

peterlupu commented 5 years ago

sorry to shamelessly bump, but bump?

pioter83 commented 5 years ago

Have the same problem, bump

ErangaD commented 5 years ago

Got the same problem. Have another question too. Does scrapy wait until all the unchecked count becomes 0 to start using good proxies? image image Because max retry count is 5. It looks like a good proxy has not been used 6 times.

wittedhaddock commented 5 years ago

shameless bump

mapb1994 commented 5 years ago

I have the same problem

danjdewhurst commented 4 years ago

Same issue. For me, it appears that it's kind of de-duplicating proxies based off the host and port. So if they're the same, they remain unchecked and only one is used.

peterlupu commented 4 years ago

How about something like this?

import random
from rotating_proxies.middlewares import RotatingProxyMiddleware
from rotating_proxies.expire import Proxies

class MyRotatingProxiesMiddleware(RotatingProxyMiddleware):
    def __init__(self, proxy_list, logstats_interval, stop_if_no_proxies, max_proxies_to_try, backoff_base, backoff_cap):
        super().__init__(proxy_list, logstats_interval, stop_if_no_proxies, max_proxies_to_try, backoff_base, backoff_cap)
        self.proxies = MyProxies(self.cleanup_proxy_list(proxy_list), backoff=self.proxies.backoff)

class MyProxies(Proxies):
    def __init__(self, proxy_list, backoff=None):
        super().__init__(proxy_list, backoff)
        self.chosen = []

    def get_random(self):
        available = list(self.unchecked | self.good)

        if not available:
            return None

        # generate unused proxy list from unchecked+good, excluding already used ones
        not_picked_yet = [x for x in available if x not in self.chosen]
        if not not_picked_yet:
            # if the list is empty, reset the chosen list and generate again
            # only happens when i completely went through all of the good+unchecked proxies
            self.chosen = []
            not_picked_yet = [x for x in available if x not in self.chosen]

        # randomly pick a proxy from the 'good' list
        chosen_proxy = random.choice(not_picked_yet)
        # mark as chosen
        self.chosen.append(chosen_proxy)
        return chosen_proxy

Then use MyRotatingProxiesMiddleware.

rajatshenoy56 commented 4 years ago

How about something like this?

import random
from rotating_proxies.middlewares import RotatingProxyMiddleware
from rotating_proxies.expire import Proxies

class MyRotatingProxiesMiddleware(RotatingProxyMiddleware):
    def __init__(self, proxy_list, logstats_interval, stop_if_no_proxies, max_proxies_to_try, backoff_base, backoff_cap):
        super().__init__(proxy_list, logstats_interval, stop_if_no_proxies, max_proxies_to_try, backoff_base, backoff_cap)
        self.proxies = MyProxies(self.cleanup_proxy_list(proxy_list), backoff=self.proxies.backoff)

class MyProxies(Proxies):
    def __init__(self, proxy_list, backoff=None):
        super().__init__(proxy_list, backoff)
        self.chosen = []

    def get_random(self):
        available = list(self.unchecked | self.good)

        if not available:
            return None

        # generate unused proxy list from unchecked+good, excluding already used ones
        not_picked_yet = [x for x in available if x not in self.chosen]
        if not not_picked_yet:
            # if the list is empty, reset the chosen list and generate again
            # only happens when i completely went through all of the good+unchecked proxies
            self.chosen = []
            not_picked_yet = [x for x in available if x not in self.chosen]

        # randomly pick a proxy from the 'good' list
        chosen_proxy = random.choice(not_picked_yet)
        # mark as chosen
        self.chosen.append(chosen_proxy)
        return chosen_proxy

Then use MyRotatingProxiesMiddleware.

Did this work for you?

timpal0l commented 2 years ago

bump

peterlupu commented 2 years ago

How about something like this?

import random
from rotating_proxies.middlewares import RotatingProxyMiddleware
from rotating_proxies.expire import Proxies

class MyRotatingProxiesMiddleware(RotatingProxyMiddleware):
    def __init__(self, proxy_list, logstats_interval, stop_if_no_proxies, max_proxies_to_try, backoff_base, backoff_cap):
        super().__init__(proxy_list, logstats_interval, stop_if_no_proxies, max_proxies_to_try, backoff_base, backoff_cap)
        self.proxies = MyProxies(self.cleanup_proxy_list(proxy_list), backoff=self.proxies.backoff)

class MyProxies(Proxies):
    def __init__(self, proxy_list, backoff=None):
        super().__init__(proxy_list, backoff)
        self.chosen = []

    def get_random(self):
        available = list(self.unchecked | self.good)

        if not available:
            return None

        # generate unused proxy list from unchecked+good, excluding already used ones
        not_picked_yet = [x for x in available if x not in self.chosen]
        if not not_picked_yet:
            # if the list is empty, reset the chosen list and generate again
            # only happens when i completely went through all of the good+unchecked proxies
            self.chosen = []
            not_picked_yet = [x for x in available if x not in self.chosen]

        # randomly pick a proxy from the 'good' list
        chosen_proxy = random.choice(not_picked_yet)
        # mark as chosen
        self.chosen.append(chosen_proxy)
        return chosen_proxy

Then use MyRotatingProxiesMiddleware.

Did this work for you?

iirc, yes

bump

please check my solution above - could still be working