Open kadimon opened 7 years ago
Currently there is no built-in way to do this.
A first option would be to use request.meta['splash']['args']['proxy']
instead of request.meta['proxy']
here: https://github.com/TeamHG-Memex/scrapy-rotating-proxies/blob/35bb23c28d95bdea7d0ae6e964954a48ee927aa2/rotating_proxies/middlewares.py#L110, and removing everything related to download_slot.
Second option would be to change the way request.meta['proxy']
is handled in scrapy-splash - instead of using this proxy for requests to Splash it may pass it as an argument to Splash.
Hi Mikhail, I'm trying to use the rotating proxies middleware with a hosted headless chrome to render pages. I tweaked middlewares.py so that it replaces request.url with the chrome instance url, passing the actual target url and the proxy generated by the middleware as url parameters (which are then handled by headless chrome).
The spider runs but I get banned as if no proxy was being used and looking at logs on the chrome server, it doesn't seem like the request are received so I'm guessing the middleware drops the requests which are passed like regular scrapy request.
Is there anything else I would need to update apart from removing this below?
request.meta['download_slot'] = self.get_proxy_slot(proxy)
Here is my code:
def process_request(self, request, spider):
if 'proxy' in request.meta and not request.meta.get('_rotating_proxy'):
return
proxy = self.proxies.get_random()
if not proxy:
if self.stop_if_no_proxies:
raise CloseSpider("no_proxies")
else:
logger.warn("No proxies available; marking all proxies "
"as unchecked")
self.proxies.reset()
proxy = self.proxies.get_random()
if proxy is None:
logger.error("No proxies available even after a reset.")
raise CloseSpider("no_proxies_after_reset")
if request.meta['type'] == 'browserless':
request.replace(body = json.dumps({'code':self.BROWSERLESS_EXEC_CODE,'context': {'url': request.url}}))
request.replace(url = self.BROWSERLESS_URL + '&--proxy-server=' + proxy)
request.replace(method = 'POST')
request.replace(headers = {'Cache-Control':'no-cache','Content-Type':'application/json'})
else:
request.meta['proxy'] = proxy
request.meta['_rotating_proxy'] = True
Currently there is no built-in way to do this.
A first option would be to use
request.meta['splash']['args']['proxy']
instead ofrequest.meta['proxy']
here:, and removing everything related to download_slot. Second option would be to change the way
request.meta['proxy']
is handled in scrapy-splash - instead of using this proxy for requests to Splash it may pass it as an argument to Splash.
Hi, I'm having the same need. I tried both options proposed:
RotatingProxyMiddleware
and implement first optionSplashMiddleware
and implement second option1) works better and is simpler to implement.
2) didn't seem to work well since SplashMiddleware
re-enters the middleware chain and a new proxy gets assign every time so both a change in RotatingProxyMiddleware
and SplashMiddleware
would be required to make it work.
@kmike Do you think making RotatingProxyMiddleware
SplashMiddleware
aware would be something useful in scrapy-rotating-proxies
. If so I could try making a PR for it.
thanks
Hi, I can't get it to work, tried this on rotating_proxies/middlewares.py
request.meta['splash']['args']['proxy'] = proxy
#request.meta['download_slot'] = self.get_proxy_slot(proxy)
Am I right to assume we have to change the rest of the process_request function and other functions as well?
Currently there is no built-in way to do this. A first option would be to use
request.meta['splash']['args']['proxy']
instead ofrequest.meta['proxy']
here: https://github.com/TeamHG-Memex/scrapy-rotating-proxies/blob/35bb23c28d95bdea7d0ae6e964954a48ee927aa2/rotating_proxies/middlewares.py#L110, and removing everything related to download_slot. Second option would be to change the way
request.meta['proxy']
is handled in scrapy-splash - instead of using this proxy for requests to Splash it may pass it as an argument to Splash.Hi, I'm having the same need. I tried both options proposed:
sub-class
RotatingProxyMiddleware
and implement first optionsub class
SplashMiddleware
and implement second optionworks better and is simpler to implement.
didn't seem to work well since
SplashMiddleware
re-enters the middleware chain and a new proxy gets assign every time so both a change inRotatingProxyMiddleware
andSplashMiddleware
would be required to make it work.@kmike Do you think making
RotatingProxyMiddleware
SplashMiddleware
aware would be something useful inscrapy-rotating-proxies
. If so I could try making a PR for it.thanks
@mxdev88 can you post the solution you ended up with? (I guess this is the first option choice). Since I have the same problem and I'd like to test your code, if possible. Thanks
Currently there is no built-in way to do this. A first option would be to use
request.meta['splash']['args']['proxy']
instead ofrequest.meta['proxy']
here: https://github.com/TeamHG-Memex/scrapy-rotating-proxies/blob/35bb23c28d95bdea7d0ae6e964954a48ee927aa2/rotating_proxies/middlewares.py#L110, and removing everything related to download_slot. Second option would be to change the way
request.meta['proxy']
is handled in scrapy-splash - instead of using this proxy for requests to Splash it may pass it as an argument to Splash.Hi, I'm having the same need. I tried both options proposed:
- sub-class
RotatingProxyMiddleware
and implement first option- sub class
SplashMiddleware
and implement second option- works better and is simpler to implement.
- didn't seem to work well since
SplashMiddleware
re-enters the middleware chain and a new proxy gets assign every time so both a change inRotatingProxyMiddleware
andSplashMiddleware
would be required to make it work.@kmike Do you think making
RotatingProxyMiddleware
SplashMiddleware
aware would be something useful inscrapy-rotating-proxies
. If so I could try making a PR for it. thanks@mxdev88 can you post the solution you ended up with? (I guess this is the first option choice). Since I have the same problem and I'd like to test your code, if possible. Thanks
setting.py
...
'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,
'xxxxxx.middlewares.SplashRotatingProxyDownloaderMiddleware': 611,
...
create SplashRotatingProxyDownloaderMiddleware
into middlewares.py
def process_request(self, request, spider):
try:
proxy = request.meta['proxy'] if request.meta and 'proxy' in request.meta else None
if proxy and 'splash' in request.meta and not getattr(request.meta['splash']['args'], 'proxy', None):
# proxy switch :)
request.meta['splash']['args']['proxy'] = proxy
request.meta['proxy'] = None
logger.debug(f'Serving Splash proxy: {proxy}')
except Exception as e:
logger.exception(e)
return None
Hi! How use this proxy rotator and scrapy-splash together?
This settings don't work: