alecxe / scrapy-fake-useragent

Random User-Agent middleware based on fake-useragent
MIT License
686 stars 98 forks source link

Restrict User-Agent to Desktop Devices #6

Closed vaulstein closed 4 years ago

vaulstein commented 7 years ago

I was using your middleware for generating fake user-agents with every scrapy request.

But the problem is that the user-agents are not limited to Desktop devices only and for user-agents like below (Ipad user-agent), the xpath extraction fails

User-agent string: Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.116 Safari/537.36 Mozilla/5.0 (iPad; U; CPU OS 3_2 like Mac OS X; en-us) AppleWebKit/531.21.10 (KHTML, like Gecko) Version/4.0.4 Mobile/7B334b Safari/531.21.10

This is probably happening because the xpaths for the fields that I want to extract might change for a Ipad device. Is there a way to limit the user-agent to only Desktop devices?

alecxe commented 7 years ago

@vaulstein I actually like the idea and I don't think there is an easy way to do that currently!

What do you think would be the best way to approach it in the most generic fashion?

Simply add something like RANDOM_UA_DESKTOP_ONLY boolean configuration setting? Or, may be provide a way to filter the "User-Agent" based on a say RANDOM_UA_FILTER_FUNCTION custom user-defined function?..

Thanks.

vaulstein commented 7 years ago

@alecxe Adding something like RANDOM_UA_DESKTOP_ONLY boolean configuration setting seems a good option to keep it generic.

Filtering only the Desktop User-agents from the fake_useragent module seems like a tough task though, since they fetch only the popular browser names from W3Schools and then for those browser's fetch their User-agent from User-agent-string. Since all the device links are listed together, I think probably regex filtering seems a likely option or querying the user-string to get the Operating System from User-agent-string itself, which would mean a lot of requests.

I'll see what I can do.

Thanks.

medse commented 7 years ago

I'm confronted with the same problem, some pages render differently for mobile devices. My thought is that depending on the site we can get away with specifying different browser. Do you have statistics on which parts of the UA are taken into consideration by the sites? I have the same Ipad string I think, it has “Mobile/“ within we can start looking at.

I've started with RANDOM_UA_TYPE (defaults to random) which is passed verbatim via: getattr(self.ua, self.ua_type) Maybe setting it to say “ie” will help @vaulstein?

I've a patch, it's trivial, can maybe even coin a pull request (I'm very new to git).

alecxe commented 7 years ago

@medse @vaulstein how about we start with the RANDOM_UA_DESKTOP_ONLY setting that will under-the-hood use python-user-agents User-Agent parser and check the is_pc attribute value?

That will probably introduce some overhead when retrying to get the other user agent string from fake-useragent, I'll put this through some tests to have some stats and numbers.

medse commented 7 years ago

@alecxe, RANDOM_UA_TYPE is more general, 'desktop' value may be added there. It won't interfere with the existing set (browser types), and we'll have the possibility to set the preferred browser still. For example I was scraping one tough site and they had stopped accepting firefoxes at some point:)

The overhead is infinitesimal in most of the use cases.

The idea with python-user-agents is very good, but it should be rather proposed to @hellysmile to be included in the underlying module. Architecturally it must be there.

hellysmile commented 7 years ago

@medse @vaulstein @alecxe Hey! I am fine with new attr like random_desktop or random_mobile with checking via python-user-agents. PR is welcome!

cangokalp commented 7 years ago

Is it available now? Restricting it only to dekstop user agents?

alecxe commented 7 years ago

@cangokalp not yet, I'll get back to the scrapy-user-agent soon and go over the latest feature requests. Thanks!

medse commented 7 years ago

@cangokalp, I can do it too, but I'm waiting for the scrapy-fake-useragent to be merged into the main fake-useragent package, this feature must reside in the main package.

kanihal commented 5 years ago

Any progress on this front? TwitterScraper won't work unless the request is from desktop only user-agent.

mxdev88 commented 4 years ago

hey @medse, @alecxe,

I'm running into the same issue. Being able to target more precisely User-Agents would be a great addition.

How about adding RANDOM_UA_DEVICE_TYPE which could take a string or an iterable ['desktop', 'tablet', 'mobile']? By default it could be set to 'random' or None (which would act the same). This would allow more precision. e.g. Give me only firefox desktop.

If python-user-agents is brought in as a dependancy, it could make use of is_mobile, is_tablet or is_pc. I agree with medse, architecturally it should be in fake-useragent if @hellysmile is ok with that. It would need to add something like UserAgent.get(browser=['ff', 'ie'], device=['desktop', 'tablet']).

thoughts?

alecxe commented 4 years ago

In 1.3.0 we introduced custom providers, and there are now more options to address this issue.

Via faker:

Or, you could also create a custom provider and have the logic there. Then add your custom provider to FAKEUSERAGENT_PROVIDERS.