alecxe / scrapy-fake-useragent

Random User-Agent middleware based on fake-useragent
MIT License
686 stars 98 forks source link

docs clarification #23

Open Ostapp opened 5 years ago

Ostapp commented 5 years ago

Does it assign a different UA to each request? Does it assign a different UA to each request retry?

vortexkd commented 4 years ago

For anyone who still wants the answer to this, Yes it assigns a new user agent to each request. You can refer to exactly how here https://pypi.org/project/fake-useragent/

but tldr, you can use the RANDOM_UA_TYPE setting (which defaults to random)

and the middleware will generate a new user agent string for each request based on the above criteria.

i-chaochen commented 4 years ago

Thank @alecxe for providing this great project.

For scrapy-proxies, I wonder what do you mean by set RANDOM_UA_PER_PROXY to be true?

Usage with scrapy-proxies

To use with middlewares of random proxy such as scrapy-proxies, you need:

set RANDOM_UA_PER_PROXY to True to allow switch per proxy
set priority of RandomUserAgentMiddleware to be greater than scrapy-proxies, so that proxy is set before handle UA

Do I need to first pip install scrapy_proxies, and then I add RANDOM_UA_PER_PROXY = True in my setting.py? or it is already included and I can add RANDOM_UA_PER_PROXY = True directly.

Also, for scrapy_proxies priority, do I need to add another DOWNLOADER_MIDDLEWARE for scrapy-proxies? I mean there will be two DOWNLOADER_MIDDLEWARES, respectively. And I then just set two priorities of fake-useragent are larger than scrapy-proxies, so I can have proxy + fake user agent together?

Because you mentioned fake user agent needs to turn off built-in UserAgentMiddleware and RetryMiddleware, and scrapy-proxies used RetryMiddleware. I am confused whether should I use the RetryMiddleware in DOWNLOADER_MIDDLEWARES or not. Thanks in advance!

# Retry many times since proxies often fail
RETRY_TIMES = 10
# Retry on most error codes since proxies fail for different reasons
RETRY_HTTP_CODES = [500, 503, 504, 400, 403, 404, 408]

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90,
    'scrapy_proxies.RandomProxy': 100,
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
}

# Proxy list containing entries like
# http://host1:port
# http://username:password@host2:port
# http://host3:port
# ...
PROXY_LIST = '/path/to/proxy/list.txt'

# Proxy mode
# 0 = Every requests have different proxy
# 1 = Take only one proxy from the list and assign it to every requests
# 2 = Put a custom proxy to use in the settings
PROXY_MODE = 0

# If proxy mode is 2 uncomment this sentence :
#CUSTOM_PROXY = "http://host1:port"
alecxe commented 4 years ago

@i-chaochen thank you for the kind words and the questions. Docs could definitely be better for this project, I agree.

Do I need to first pip install scrapy_proxies, and then I add RANDOM_UA_PER_PROXY = True in my setting.py? or it is already included and I can add RANDOM_UA_PER_PROXY = True directly.

Yeah, scrapy_proxies is not listed in project requirements and you would need to install it separately.

Also, for scrapy_proxies priority, do I need to add another DOWNLOADER_MIDDLEWARE for scrapy-proxies? I mean there will be two DOWNLOADER_MIDDLEWARES, respectively. And I then just set two priorities of fake-useragent are larger than scrapy-proxies, so I can have proxy + fake user agent together?

It seems so. Though, I have not used this combination of scrapy-fake-useragent and scrapy-proxies myself. I'd say do some experimentation with the middlewares setup while logging proxies and headers.

Hope that helps.

i-chaochen commented 4 years ago

@alecxe Thanks. After reading your code and a couple of tries I think I figured it out and tested it OK.

  1. RANDOM_UA_PER_PROXY = True
  2. in the spider file, to add scrapy.Request(meta={'proxy'} : `your_proxy_address`)

But just need to remember, if we set RANDOM_UA_PER_PROXY = True, the UA would be fixed for each request and only random for each proxy address.