alecxe / scrapy-fake-useragent

Random User-Agent middleware based on fake-useragent
MIT License
686 stars 98 forks source link

scrapy-fake-useragent and cfscrape cloudfare anti bot library #9

Closed reyman closed 7 years ago

reyman commented 7 years ago

Hi, This is more a question than an issue i suppose but perhaps you can help me. I'm trying to create a scraper using your extension with cfscrape, privoxy, and scrapy_fake_useragent I'm using cfscrape python extension to bypass cloudfare protection with scrapy.

To collect cookie needed by cfscrape, i need to redefine the start_request function into my spider class, like this :

    def start_requests(self):
        cf_requests = []
        for url in self.start_urls:
            token, agent = cfscrape.get_tokens(url)
            self.logger.info("agent = %s", agent)
            cf_requests.append(scrapy.Request(url=url,
                                              cookies= token,
                                              headers={'User-Agent': agent}))
        return cf_requests

My problem is that the user_agent collected by start_requests is not the same that the user_agent randomly selected by scrapy_fake_useragent , as you can see :

2017-01-11 11:52:55 [airports] INFO: agent = Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36
2017-01-11 11:52:55 [scrapy.core.engine] INFO: Spider opened
2017-01-11 11:52:55 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-01-11 11:52:55 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-01-11 11:52:55 [scrapy_fake_useragent.middleware] DEBUG: Assign User-Agent Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36 to Proxy None

I defined my extension in this order :

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
    'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware': 400,
    'flight_project.middlewares.ProxyMiddleware': 100,
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware':110,
    }

I need the same user_agent, so how can i pass the good user agent generated by scrapy_fake_useragent into the start_requests method ?

alecxe commented 7 years ago

@reyman Hi! Good question!

The problem is, the middleware would not re-assign a user-agent if it is set explicitly, like in your example.

If you want to let the middleware set a random user agent, just don't set the User-Agent header:

def start_requests(self):
    cf_requests = []
    for url in self.start_urls:
        token, agent = cfscrape.get_tokens(url)
        self.logger.info("agent = %s", agent)
        cf_requests.append(scrapy.Request(url=url, cookies=token))
    return cf_requests

Hope I'm understanding the problem correctly. Thanks.

reyman commented 7 years ago

@alecxe Thanks for your answer. I try like that, but User-Agent continue to differ, i think my question is not as clear as i think.

My problem is that cfscrape define a random user_agent from a limited list directly writted in the code (see here ) if no user_agent is defined when i run cfscrape.get_tokens(url).

So the only way i see is to get the random User_Agent generated by your middleware and inject it into cfscrape.get_tokens(). This method make a first call to url to resolve the cloudfare anti bot measure, and after that return a cookie which authorize next url requests.

But i suppose that is not possible to get User_Agent (for example) ua generated by your middleware before the start_requests(self) run cfscape.get_tokens(url,ua) ?

alecxe commented 7 years ago

@reyman gotcha. Well, not the most beautiful solution, but as a workaround, you can try generating the random User-Agent directly with:

from fake_useragent import UserAgent

ua = UserAgent()
user_agent = self.ua.random

Please let me know if this is good enough..thanks.

reyman commented 7 years ago

Yeah it's work like that, thanks :+1:

    from fake_useragent import UserAgent
    ua = UserAgent()
    ...
    def start_requests(self):
        cf_requests = []
        user_agent = self.ua.random
        self.logger.info("RANDOM user_agent = %s", user_agent)
        for url in self.start_urls:
            token , agent = cfscrape.get_tokens(url,user_agent)
            self.logger.info("token = %s", token)
            self.logger.info("agent = %s", agent)

            cf_requests.append(scrapy.Request(url=url,
                                              cookies= token,
                                              headers={'User-Agent': agent}))
        return cf_requests