alecxe / scrapy-fake-useragent

Random User-Agent middleware based on fake-useragent
MIT License
686 stars 98 forks source link

FAKE_USERAGENT_RANDOM_UA_TYPE = 'desktop' >>> KeyError: 'desktop' #31

Closed codekoriko closed 4 years ago

codekoriko commented 4 years ago

unlike what is said in the doc, it seems like the only possible filter are by browser "type" not plateform type. ie "firefox" "chrome" "safari" ect.

any guess about which one is least likely to have mobile User Agent in their list?

alecxe commented 4 years ago

@psychonaute fake-useragent provider does not support desktop from what I understand.

faker though does, for sure:

FAKEUSERAGENT_PROVIDERS = ['scrapy_fake_useragent.providers.FakerProvider']
FAKER_RANDOM_UA_TYPE = 'desktop'

Hope that helps.

alecxe commented 4 years ago

I've also updated the README to remove confusion. Thank you for spotting!

codekoriko commented 4 years ago

I used your settings with latest version: scrapy-fake-useragent==1.4.4

FAKEUSERAGENT_PROVIDERS = ['scrapy_fake_useragent.providers.FakerProvider']
FAKER_RANDOM_UA_TYPE = 'desktop'

Still not working for me 🔎

with all the Faker version I tried Faker==4.1.1 Faker==4.1.2 Faker==4.1.3 they all got the following options: user_agents:('chrome', 'firefox', 'internet_explorer', 'opera', 'safari')

Am I missing something?

alecxe commented 4 years ago

@psychonaute oh, you are totally right, there is no such option in Faker UserAgent provider at all. Don't listen to me :) I've updated the doc to remove that option at all.

But, as you could have your own custom provider, you could create one that would keep generating faker user agent strings until you see the "desktop" (or "mobile" if you need to) one. Here is an example based on python-useragents package (https://github.com/selwin/python-user-agents):

from scrapy_fake_useragent.providers import FakerProvider
from user_agents import parse

class DesktopUserAgentProvider(FakerProvider):
    def get_random_ua(self):
        """If given type is mobile, then use is_mobile check, otherwise default to desktop."""
        user_agent = self._ua.user_agent()
        while not parse(user_agent).is_pc:
            user_agent = self._ua.user_agent()
        return user_agent

FAKEUSERAGENT_PROVIDERS = [
    'demo.settings.DesktopUserAgentProvider',
]

Do pip install pyyaml ua-parser user-agents prior to executing this.

Hope that helps.

codekoriko commented 4 years ago

I add the similar idea and I was looking at this parser: https://github.com/thinkwelltwd/device_detector But the parsing was so slow. But python-useragent is very speedy and seems reliable enough.

Anyway, I ended up creating my own provider using https://github.com/Luqman-Ud-Din/random_user_agent The user agent database 2 years old (~300k UAs) but it was a "popularity" criterion that is appealing.

In my settings.py

RANDOMUSERAGENT_RANDOM_UA_TYPE = {
    'hardware_types': 'COMPUTER' ,
    'popularity': 'POPULAR'
}

In my providers.py

class RandomUserAgentProvider(BaseProvider):
    """
    Provides a random set of UA strings, powered by the Faker library.
    """

    DEFAULT_UA_TYPE = ''

    def __init__(self, settings):
        BaseProvider.__init__(self, settings)

        self._ua_type = settings.get('RANDOMUSERAGENT_RANDOM_UA_TYPE',
                                     self.DEFAULT_UA_TYPE)
        # mapping Enum class - init params equivalence
        CLASS_MAP = {
                'hardware_types': HardwareType,
                'software_types': SoftwareType,
                'software_names': SoftwareName,
                'software_engines': SoftwareEngine,
                'operating_systems': OperatingSystem,
                'popularity': Popularity,
            }

        # loop through our filters list to retrieve their init param's value
        params = {}
        for filter_cat, filter_value in self._ua_type.items():
            match = getattr(CLASS_MAP[filter_cat], filter_value.upper(), None)
            if match:
                params[filter_cat] = match.value
            else:
                logger.error("Error: Couldn't find a matching filter for '%s' ",filter_value ) 
                raise Exception("Could'nt find a matching filter for: '%s' ",filter_value ) 

        # build a list of 100 UA to randomly pick from
        self._ua = UserAgent(**params, limit=100)

    def get_random_ua(self):
        try:
            ua = self._ua.get_random_user_agent()
            nb_ua = len(self._ua.user_agents)
            if nb_ua < 100:
                logger.warning("Only '%s' UAs matched those criterions: '%s'. "
                            "Try using less restrictive ones",
                            nb_ua, " | ".join(self._ua_type))
            return ua
        except IndexError:
            logger.debug("Couldn't retrieve UA type matching those criterions: '%s'. ",
                         "Beware of conflicting ones like 'ANDROID | COMPUTER' "
                         "Using default: '%s'",
                         " | ".join(self._ua_type), self.DEFAULT_UA_TYPE)
            return getattr(self._ua, self.DEFAULT_UA_TYPE)()
alecxe commented 4 years ago

This is awesome, thank you for sharing!

fugkco commented 1 year ago

@psychonaute's solution is great, but is missing a few things:

Needs a classmethod to ensure the settings is passed in when the provider is constructed:

class RandomUserAgentProvider(BaseProvider):
    # ...
    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler.settings)

    def __init__(self, settings):
        # ...

And to actually have scrapy use it, you'll need to add it to the downloader middlewares settings:

DOWNLOADER_MIDDLEWARES = {
    # other middlewars
    'scraper_module.providers.RandomUserAgentProvider': 120,
}

@alecxe any chance we can get something as part of this library for this solution? I feel like everyone is potentially creating their own small module to do this, and it's causing lots of duplication and effort.