Open john-parton opened 7 years ago
I have this problem too, lots of unchecked proxies, but I have no dead ones.
[rotating_proxies.middlewares] INFO: Proxies(good: 97, dead: 0, unchecked: 97, reanimated: 6, mean backoff time: 0s)
Edit:
I think the 'problem' is that the proxy is loaded randomly between good and unchecked ones.
The main issue here is that I have a DOWNLOAD_DELAY set to 1000 seconds, according to the docs, it should be set per-proxy now. So am I wrong when saying that, in theory, if I have 100 proxies, they should each begin with a request and then every one of them has its own 1000 second delay?
If so, getting a new proxy randomly from good un unchecked ones would slow down the spider. In theory you could end up randomly getting only 10 of the 100 proxies each time get_random()
is called, so you'd wait 1000 seconds per each of the the 10 proxies, and having 90 unused proxies.
Thoughts on this?
sorry to shamelessly bump, but bump?
Have the same problem, bump
Got the same problem. Have another question too. Does scrapy wait until all the unchecked count becomes 0 to start using good proxies? Because max retry count is 5. It looks like a good proxy has not been used 6 times.
shameless bump
I have the same problem
Same issue. For me, it appears that it's kind of de-duplicating proxies based off the host and port. So if they're the same, they remain unchecked and only one is used.
How about something like this?
import random
from rotating_proxies.middlewares import RotatingProxyMiddleware
from rotating_proxies.expire import Proxies
class MyRotatingProxiesMiddleware(RotatingProxyMiddleware):
def __init__(self, proxy_list, logstats_interval, stop_if_no_proxies, max_proxies_to_try, backoff_base, backoff_cap):
super().__init__(proxy_list, logstats_interval, stop_if_no_proxies, max_proxies_to_try, backoff_base, backoff_cap)
self.proxies = MyProxies(self.cleanup_proxy_list(proxy_list), backoff=self.proxies.backoff)
class MyProxies(Proxies):
def __init__(self, proxy_list, backoff=None):
super().__init__(proxy_list, backoff)
self.chosen = []
def get_random(self):
available = list(self.unchecked | self.good)
if not available:
return None
# generate unused proxy list from unchecked+good, excluding already used ones
not_picked_yet = [x for x in available if x not in self.chosen]
if not not_picked_yet:
# if the list is empty, reset the chosen list and generate again
# only happens when i completely went through all of the good+unchecked proxies
self.chosen = []
not_picked_yet = [x for x in available if x not in self.chosen]
# randomly pick a proxy from the 'good' list
chosen_proxy = random.choice(not_picked_yet)
# mark as chosen
self.chosen.append(chosen_proxy)
return chosen_proxy
Then use MyRotatingProxiesMiddleware.
How about something like this?
import random from rotating_proxies.middlewares import RotatingProxyMiddleware from rotating_proxies.expire import Proxies class MyRotatingProxiesMiddleware(RotatingProxyMiddleware): def __init__(self, proxy_list, logstats_interval, stop_if_no_proxies, max_proxies_to_try, backoff_base, backoff_cap): super().__init__(proxy_list, logstats_interval, stop_if_no_proxies, max_proxies_to_try, backoff_base, backoff_cap) self.proxies = MyProxies(self.cleanup_proxy_list(proxy_list), backoff=self.proxies.backoff) class MyProxies(Proxies): def __init__(self, proxy_list, backoff=None): super().__init__(proxy_list, backoff) self.chosen = [] def get_random(self): available = list(self.unchecked | self.good) if not available: return None # generate unused proxy list from unchecked+good, excluding already used ones not_picked_yet = [x for x in available if x not in self.chosen] if not not_picked_yet: # if the list is empty, reset the chosen list and generate again # only happens when i completely went through all of the good+unchecked proxies self.chosen = [] not_picked_yet = [x for x in available if x not in self.chosen] # randomly pick a proxy from the 'good' list chosen_proxy = random.choice(not_picked_yet) # mark as chosen self.chosen.append(chosen_proxy) return chosen_proxy
Then use MyRotatingProxiesMiddleware.
Did this work for you?
bump
How about something like this?
import random from rotating_proxies.middlewares import RotatingProxyMiddleware from rotating_proxies.expire import Proxies class MyRotatingProxiesMiddleware(RotatingProxyMiddleware): def __init__(self, proxy_list, logstats_interval, stop_if_no_proxies, max_proxies_to_try, backoff_base, backoff_cap): super().__init__(proxy_list, logstats_interval, stop_if_no_proxies, max_proxies_to_try, backoff_base, backoff_cap) self.proxies = MyProxies(self.cleanup_proxy_list(proxy_list), backoff=self.proxies.backoff) class MyProxies(Proxies): def __init__(self, proxy_list, backoff=None): super().__init__(proxy_list, backoff) self.chosen = [] def get_random(self): available = list(self.unchecked | self.good) if not available: return None # generate unused proxy list from unchecked+good, excluding already used ones not_picked_yet = [x for x in available if x not in self.chosen] if not not_picked_yet: # if the list is empty, reset the chosen list and generate again # only happens when i completely went through all of the good+unchecked proxies self.chosen = [] not_picked_yet = [x for x in available if x not in self.chosen] # randomly pick a proxy from the 'good' list chosen_proxy = random.choice(not_picked_yet) # mark as chosen self.chosen.append(chosen_proxy) return chosen_proxy
Then use MyRotatingProxiesMiddleware.
Did this work for you?
iirc, yes
bump
please check my solution above - could still be working
After running the crawler for over a day, I still have a lot of proxies in the "unchecked" state.
It looks like those 524 unchecked proxies are just timing out, but they're not getting moved to dead, so a lot of time is wasted sending requests to them.
I set my timeout pretty low with
DOWNLOAD_TIMEOUT = 15
.Let me know if you need anything from me: parts of my crawler, settings, etc.
Thanks.
Edit: I have the BanDetectionMiddleware installed.