Closed 0xfede7c8 closed 4 years ago
There's no responce from contributors. What will we do? This lib is very helpful.
The problem is in fact in the underlying lib (fake-useragent). I filed an issue there also.
Yeah, this has pretty much been a ticking time bomb all this time because of the useragentstring.com availability being a dependency.
One workaround here is to set FAKEUSERAGENT_FALLBACK
setting for the user agent to fall back.
This scrapy plugin is itself a very thin layer before fake-useragent itself and we can try having a more "dynamic" fallback - e.g. we can start using faker
and User-Agent provider: https://hexdocs.pm/faker/Faker.Internet.UserAgent.html
Thoughts?
Isn't faker a golang library? Which database is faker using?
Oh, sorry, wrong link. Here is the python-based one I meant: https://faker.readthedocs.io/en/stable/providers/faker.providers.user_agent.html?highlight=user_agent#faker.providers.user_agent.Provider.user_agent
It is generating strings, not looking them up like fake-useragent does, which could, depending on a use-case, be crucial.
I am leaning towards supporting both use cases in this plugin. We could have a setting which would be used to specify a list of User-Agent providers - which by default could be set to use just fake-useragent
only. But, one could specify a second one in case the above failed and it could either be fake-useagent or a custom one. E.g.:
FAKEUSERAGENT_PROVIDERS = [
'scrapy_fake_useragent.providers.FakeUserAgent',
'scrapy_fake_useragent.providers.Faker',
'mypackage.providers.CustomProvider'
]
# default one is this (for backwards-compatibility)
# FAKEUSERAGENT_PROVIDERS = [
# 'scrapy_fake_useragent.providers.FakeUserAgent'
# ]
Something like this and, with proper documentation, we would never have to talk about this again :)
I think that is a good design. There could be another provider that you just specify a file with user agents, in a given format, and it randomizes those. I can work on this if you want, to contribute, as I'm already using this lib in production.
Question: the FAKEUSERAGENT_FALLBACK option can be deprecated right?
Edit: Ok, I got that the fallback
is a fake_useragent parameter. We can have that parameter for every provider.
I can work on this if you want, to contribute, as I'm already using this lib in production.
That would be awesome, appreciate it! 👍
Almost have it. Will send PR when ready.
Another change I made to support this new structure is to add a RANDOM_UA_TYPE setting for each provider:
This is because each provider would internally can have different flags for choosing what UA will use.
I also added a FixedUserAgent provider which just provides one UA, the default one, selected with scrapy's setting USER_AGENT
@0xfede7c8 great work with the pull request. I've left a few comments but would take a closer look when there are tests. Thanks a lot for the work and lemme know if you need any help.
Just uploaded 1.3.0 to PyPI. @0xfede7c8 thanks a lot for all the work you've put into this!
It is failing to fetch it because the site seems to be down or not working properly. Consider removing it from the list or replacing it with another list.
This problem renders the tools useless.