Closed kkristof200 closed 3 years ago
By default when you are scraping data from CLI tool, then user-agent is being randomized to avoid blocking. Setting custom user-agent doesn't make any sense
I've seen the randomUa variable, but it only randomizes part of a chrome version which is not 'random enough' for my use case, I'm using it via CLI from python and have a random UA already in use in my python env, so I've been looking for a way to inject that value in the lib.
I've seen that ua is a parameter in the constructor too, so it should be only added to the exported args list for the cli
What do you mean by "not 'random enough' " ?
Randomizing version is enough to avoid blocking
The thing with the libs random user-agent is that it only changes Chrome version, more specifically it randomizes the Chrome major version between 65-79 and appends the minor version after it.
Problem nr. 1: it only changes the Chrome version. so the 'randomness' is only the chrome major version (15 cases) Problem nr. 2: The minor version is the same in each case, which can be suspect.
If I understand the code correctly these are all the possible outcomes of a random ua:
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.4044.113 Safari/537.36
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.4044.113 Safari/537.36
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.4044.113 Safari/537.36
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.4044.113 Safari/537.36
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.4044.113 Safari/537.36
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.4044.113 Safari/537.36
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.4044.113 Safari/537.36
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.4044.113 Safari/537.36
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.4044.113 Safari/537.36
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.4044.113 Safari/537.36
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.4044.113 Safari/537.36
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.4044.113 Safari/537.36
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.4044.113 Safari/537.36
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.4044.113 Safari/537.36
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.4044.113 Safari/537.36
Never had problem with this settings, and scraper is being used heavily every day by a lot of people. If you make a lots of request and still getting blocked then use proxy
Ok, I will add this to the to do list
Thanks. As for the --ua param, it would only be adding this to the to the bin/cli.js file, right?
ua: {
default: null,
type: 'string',
describe: 'Pass a custom user-agent to use. This helps to prevent request blocking from the amazon side',
},
If that is the case, I can make a fork/pr so you only have to approve/publish it.
--user-agent is available in the latest version
I've seen that a custom ua can be passed to the constructor, but I can't pass it via cli.