Usage of setdefault and retries

lee-hodg commented 6 years ago

Scrapy headers class inherits from scrapy.utils.datatypes.CaselessDict. The setdefault method is defined as

def setdefault(self, key, def_val=None):
    return dict.setdefault(self, self.normkey(key), self.normvalue(def_val))

So just like a regular Python dictionary the setdefault method sets the value to return when a given key is not found, for example:

d = {'User-Agent': 'some fake ua'}
d.setdefault('User-Agent',  'default ua')
d['User-Agent']
Out[10]: 'some fake ua'

Note how this wasn't "default ua" because there already a 'User-Agent' set. Compare this to:

d = {}
d.setdefault('User-Agent', 'default ua')
d['User-Agent']
Out[13]: 'default ua'

There was nothing set, so in this case the default value was of course returned.

This is fine for fresh requests which never had a user-agent set in the headers, but if using the requests retry middleware too, then it means that the user-agent will be the same for every single retry, which might not be desired behaviour (especially if you want a different proxy/ua per request and one combination is failing so you want to retry with a fresh combo).

If you just did

request.headers['User-Agent'] = ....

it would solve this problem.

Or if you don't think the desired behaviour should always be to change the User-Agent upon retries, then there could at least be a setting to allow the user to switch this on/off.

alecxe commented 6 years ago

Yes, I think it is reasonable to allow to control whether to re-generate a user-agent header on retry or not. Thanks!

vionemc commented 5 years ago

The retry function in Scrapy is named _retry which means it's best to not be changed. I tried to create myself the retry function so it would change the user agent each time it retries, but I don't think it's best to replace the original retry just like that.

Does anyone have any opinion on what is best to do this? If Scrapy would provide a retry function with an extra function to change some things on the request on each retry, that will be great.

vionemc commented 5 years ago

I have an idea. I found two functions calling _retry. So, retry happens on exception and on retry statuses. We need to change the user agent on both functions in the request before it is retried. So this is the code:

from scrapy.downloadermiddlewares.retry import *
from scrapy.spidermiddlewares.httperror import *

from fake_useragent import UserAgent

class Retry500Middleware(RetryMiddleware):

    def __init__(self, settings):
        super(Retry500Middleware, self).__init__(settings)

        fallback = settings.get('FAKEUSERAGENT_FALLBACK', None)
        self.ua = UserAgent(fallback=fallback)
        self.ua_type = settings.get('RANDOM_UA_TYPE', 'random')

    def get_ua(self):
        '''Gets random UA based on the type setting (random, firefox…)'''
        return getattr(self.ua, self.ua_type)

    def process_response(self, request, response, spider):
        if request.meta.get('dont_retry', False):
            return response
        if response.status in self.retry_http_codes:
            reason = response_status_message(response.status)
            request.headers['User-Agent'] = self.get_ua()
            return self._retry(request, reason, spider) or response
        return response

    def process_exception(self, request, exception, spider):
        if isinstance(exception, self.EXCEPTIONS_TO_RETRY) \
                and not request.meta.get('dont_retry', False):
            request.headers['User-Agent'] = self.get_ua()
            return self._retry(request, exception, spider)

alecxe commented 5 years ago

@vionemc I like that idea. I'd say that could be a good feature to add to scrapy-fake-useragent and control with a setting which should probably be turned on by default. Do you have time/capacity to send an MR? Thank you so much.

vionemc commented 5 years ago

I haven't thought how to integrate it with the current extension. And I never do an open-source MR. So I am not sure.

alecxe commented 4 years ago

https://github.com/alecxe/scrapy-fake-useragent/pull/24 closes this, correct? @vionemc

Thanks a lot for the changes!

vionemc commented 4 years ago

@alecxe Yup. We can close this one. Finally, I can make a contribution to open source. :)

alecxe commented 4 years ago

Awesome, I'll add a few tests and publish a new version to PyPI before closing.

alecxe commented 4 years ago

Okay, added tests, cleaned things up a bit and deployed 1.2.0 to PyPI. Happy Holidays!

alecxe / scrapy-fake-useragent

Usage of setdefault and retries #20