Closed lee-hodg closed 4 years ago
Yes, I think it is reasonable to allow to control whether to re-generate a user-agent header on retry or not. Thanks!
The retry function in Scrapy is named _retry
which means it's best to not be changed. I tried to create myself the retry function so it would change the user agent each time it retries, but I don't think it's best to replace the original retry just like that.
Does anyone have any opinion on what is best to do this? If Scrapy would provide a retry function with an extra function to change some things on the request on each retry, that will be great.
I have an idea. I found two functions calling _retry
. So, retry happens on exception and on retry statuses. We need to change the user agent on both functions in the request before it is retried. So this is the code:
from scrapy.downloadermiddlewares.retry import *
from scrapy.spidermiddlewares.httperror import *
from fake_useragent import UserAgent
class Retry500Middleware(RetryMiddleware):
def __init__(self, settings):
super(Retry500Middleware, self).__init__(settings)
fallback = settings.get('FAKEUSERAGENT_FALLBACK', None)
self.ua = UserAgent(fallback=fallback)
self.ua_type = settings.get('RANDOM_UA_TYPE', 'random')
def get_ua(self):
'''Gets random UA based on the type setting (random, firefox…)'''
return getattr(self.ua, self.ua_type)
def process_response(self, request, response, spider):
if request.meta.get('dont_retry', False):
return response
if response.status in self.retry_http_codes:
reason = response_status_message(response.status)
request.headers['User-Agent'] = self.get_ua()
return self._retry(request, reason, spider) or response
return response
def process_exception(self, request, exception, spider):
if isinstance(exception, self.EXCEPTIONS_TO_RETRY) \
and not request.meta.get('dont_retry', False):
request.headers['User-Agent'] = self.get_ua()
return self._retry(request, exception, spider)
@vionemc I like that idea. I'd say that could be a good feature to add to scrapy-fake-useragent
and control with a setting which should probably be turned on by default. Do you have time/capacity to send an MR? Thank you so much.
I haven't thought how to integrate it with the current extension. And I never do an open-source MR. So I am not sure.
https://github.com/alecxe/scrapy-fake-useragent/pull/24 closes this, correct? @vionemc
Thanks a lot for the changes!
@alecxe Yup. We can close this one. Finally, I can make a contribution to open source. :)
Awesome, I'll add a few tests and publish a new version to PyPI before closing.
Okay, added tests, cleaned things up a bit and deployed 1.2.0 to PyPI. Happy Holidays!
Scrapy headers class inherits from
scrapy.utils.datatypes.CaselessDict
. Thesetdefault
method is defined asSo just like a regular Python dictionary the
setdefault
method sets the value to return when a given key is not found, for example:Note how this wasn't "default ua" because there already a 'User-Agent' set. Compare this to:
There was nothing set, so in this case the default value was of course returned.
This is fine for fresh requests which never had a user-agent set in the headers, but if using the requests retry middleware too, then it means that the user-agent will be the same for every single retry, which might not be desired behaviour (especially if you want a different proxy/ua per request and one combination is failing so you want to retry with a fresh combo).
If you just did
it would solve this problem.
Or if you don't think the desired behaviour should always be to change the User-Agent upon retries, then there could at least be a setting to allow the user to switch this on/off.