alecxe / scrapy-fake-useragent

Random User-Agent middleware based on fake-useragent
MIT License
687 stars 98 forks source link

[CRITICAL] useragentstring.com not working anymore #27

Closed 0xfede7c8 closed 4 years ago

0xfede7c8 commented 4 years ago
2020-07-17 16:21:31 [fake_useragent] DEBUG: Error occurred during fetching http://useragentstring.com/pages/useragentstring.php?name=Chrome

It is failing to fetch it because the site seems to be down or not working properly. Consider removing it from the list or replacing it with another list.

This problem renders the tools useless.

nacknime-official commented 4 years ago

There's no responce from contributors. What will we do? This lib is very helpful.

0xfede7c8 commented 4 years ago

The problem is in fact in the underlying lib (fake-useragent). I filed an issue there also.

https://github.com/hellysmile/fake-useragent/issues/99

alecxe commented 4 years ago

Yeah, this has pretty much been a ticking time bomb all this time because of the useragentstring.com availability being a dependency.

One workaround here is to set FAKEUSERAGENT_FALLBACK setting for the user agent to fall back.

This scrapy plugin is itself a very thin layer before fake-useragent itself and we can try having a more "dynamic" fallback - e.g. we can start using faker and User-Agent provider: https://hexdocs.pm/faker/Faker.Internet.UserAgent.html

Thoughts?

0xfede7c8 commented 4 years ago

Isn't faker a golang library? Which database is faker using?

alecxe commented 4 years ago

Oh, sorry, wrong link. Here is the python-based one I meant: https://faker.readthedocs.io/en/stable/providers/faker.providers.user_agent.html?highlight=user_agent#faker.providers.user_agent.Provider.user_agent

It is generating strings, not looking them up like fake-useragent does, which could, depending on a use-case, be crucial.

alecxe commented 4 years ago

I am leaning towards supporting both use cases in this plugin. We could have a setting which would be used to specify a list of User-Agent providers - which by default could be set to use just fake-useragent only. But, one could specify a second one in case the above failed and it could either be fake-useagent or a custom one. E.g.:

FAKEUSERAGENT_PROVIDERS = [
    'scrapy_fake_useragent.providers.FakeUserAgent',
    'scrapy_fake_useragent.providers.Faker',
    'mypackage.providers.CustomProvider'
]

# default one is this (for backwards-compatibility)
# FAKEUSERAGENT_PROVIDERS = [
#     'scrapy_fake_useragent.providers.FakeUserAgent'
# ]

Something like this and, with proper documentation, we would never have to talk about this again :)

0xfede7c8 commented 4 years ago

I think that is a good design. There could be another provider that you just specify a file with user agents, in a given format, and it randomizes those. I can work on this if you want, to contribute, as I'm already using this lib in production.

0xfede7c8 commented 4 years ago

Question: the FAKEUSERAGENT_FALLBACK option can be deprecated right?

Edit: Ok, I got that the fallback is a fake_useragent parameter. We can have that parameter for every provider.

alecxe commented 4 years ago

I can work on this if you want, to contribute, as I'm already using this lib in production.

That would be awesome, appreciate it! 👍

0xfede7c8 commented 4 years ago

Almost have it. Will send PR when ready.

Another change I made to support this new structure is to add a RANDOM_UA_TYPE setting for each provider:

This is because each provider would internally can have different flags for choosing what UA will use.

I also added a FixedUserAgent provider which just provides one UA, the default one, selected with scrapy's setting USER_AGENT

alecxe commented 4 years ago

@0xfede7c8 great work with the pull request. I've left a few comments but would take a closer look when there are tests. Thanks a lot for the work and lemme know if you need any help.

alecxe commented 4 years ago

Just uploaded 1.3.0 to PyPI. @0xfede7c8 thanks a lot for all the work you've put into this!