Closed vaulstein closed 4 years ago
@vaulstein I actually like the idea and I don't think there is an easy way to do that currently!
What do you think would be the best way to approach it in the most generic fashion?
Simply add something like RANDOM_UA_DESKTOP_ONLY
boolean configuration setting? Or, may be provide a way to filter the "User-Agent" based on a say RANDOM_UA_FILTER_FUNCTION
custom user-defined function?..
Thanks.
@alecxe Adding something like RANDOM_UA_DESKTOP_ONLY
boolean configuration setting seems a good option to keep it generic.
Filtering only the Desktop User-agents from the fake_useragent
module seems like a tough task though, since they fetch only the popular browser names from W3Schools and then for those browser's fetch their User-agent from User-agent-string.
Since all the device links are listed together, I think probably regex filtering seems a likely option or querying the user-string to get the Operating System
from User-agent-string itself, which would mean a lot of requests.
I'll see what I can do.
Thanks.
I'm confronted with the same problem, some pages render differently for mobile devices. My thought is that depending on the site we can get away with specifying different browser. Do you have statistics on which parts of the UA are taken into consideration by the sites? I have the same Ipad string I think, it has “Mobile/“ within we can start looking at.
I've started with RANDOM_UA_TYPE (defaults to random) which is passed verbatim via:
getattr(self.ua, self.ua_type)
Maybe setting it to say “ie” will help @vaulstein?
I've a patch, it's trivial, can maybe even coin a pull request (I'm very new to git).
@medse @vaulstein how about we start with the RANDOM_UA_DESKTOP_ONLY
setting that will under-the-hood use python-user-agents
User-Agent parser and check the is_pc
attribute value?
That will probably introduce some overhead when retrying to get the other user agent string from fake-useragent
, I'll put this through some tests to have some stats and numbers.
@alecxe, RANDOM_UA_TYPE is more general, 'desktop' value may be added there. It won't interfere with the existing set (browser types), and we'll have the possibility to set the preferred browser still. For example I was scraping one tough site and they had stopped accepting firefoxes at some point:)
The overhead is infinitesimal in most of the use cases.
The idea with python-user-agents is very good, but it should be rather proposed to @hellysmile to be included in the underlying module. Architecturally it must be there.
@medse @vaulstein @alecxe Hey! I am fine with new attr like random_desktop
or random_mobile
with checking via python-user-agents
. PR is welcome!
Is it available now? Restricting it only to dekstop user agents?
@cangokalp not yet, I'll get back to the scrapy-user-agent
soon and go over the latest feature requests. Thanks!
@cangokalp, I can do it too, but I'm waiting for the scrapy-fake-useragent to be merged into the main fake-useragent package, this feature must reside in the main package.
Any progress on this front? TwitterScraper won't work unless the request is from desktop only user-agent.
hey @medse, @alecxe,
I'm running into the same issue. Being able to target more precisely User-Agents would be a great addition.
How about adding RANDOM_UA_DEVICE_TYPE
which could take a string or an iterable ['desktop', 'tablet', 'mobile']
? By default it could be set to 'random' or None
(which would act the same). This would allow more precision. e.g. Give me only firefox desktop.
If python-user-agents
is brought in as a dependancy, it could make use of is_mobile
, is_tablet
or is_pc
. I agree with medse, architecturally it should be in fake-useragent if @hellysmile is ok with that. It would need to add something like UserAgent.get(browser=['ff', 'ie'], device=['desktop', 'tablet'])
.
thoughts?
In 1.3.0 we introduced custom providers, and there are now more options to address this issue.
Via faker:
FAKEUSERAGENT_PROVIDERS = ['scrapy_fake_useragent.providers.FakerProvider']
in settings to use Faker as a providerFAKER_RANDOM_UA_TYPE = "chrome"
or FAKER_RANDOM_UA_TYPE = "firefox"
etc. (there is no generic desktop
option for faker though - reference) Or, you could also create a custom provider and have the logic there. Then add your custom provider to FAKEUSERAGENT_PROVIDERS
.
I was using your middleware for generating fake user-agents with every scrapy request.
But the problem is that the user-agents are not limited to Desktop devices only and for user-agents like below (Ipad user-agent), the xpath extraction fails
This is probably happening because the xpaths for the fields that I want to extract might change for a Ipad device. Is there a way to limit the user-agent to only Desktop devices?