Anorov / cloudflare-scrape

A Python module to bypass Cloudflare's anti-bot page.
MIT License
3.35k stars 458 forks source link

cloudflare #371

Open rnrnstar2 opened 4 years ago

rnrnstar2 commented 4 years ago

Before creating an issue, first upgrade cfscrape with pip install -U cfscrape and see if you're still experiencing the problem. Please also confirm your Node version (node --version or nodejs --version) is version 10 or higher.

Make sure the website you're having issues with is actually using anti-bot protection by Cloudflare and not a competitor like Imperva Incapsula or Sucuri. And if you're using an anonymizing proxy, a VPN, or Tor, Cloudflare often flags those IPs and may block you or present you with a captcha as a result.

Please confirm the following statements and check the boxes before creating an issue:

Python version number

Run python --version and paste the output below:

cfscrape version number

Run pip show cfscrape and paste the output below:

Code snippet involved with the issue

2020-06-16 18:42:03 [scrapy.utils.log] INFO: Scrapy 1.6.0 started (bot: scraping)
2020-06-16 18:42:03 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.9, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 20.3.0, Python 3.7.7 (default, May  6 2020, 04:59:01) - [Clang 4.0.1 (tags/RELEASE_401/final)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g  21 Apr 2020), cryptography 2.9.2, Platform Darwin-19.5.0-x86_64-i386-64bit
2020-06-16 18:42:03 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'scraping', 'CONCURRENT_REQUESTS': 32, 'CONCURRENT_REQUESTS_PER_DOMAIN': 32, 'COOKIES_ENABLED': False, 'DOWNLOAD_DELAY': 2, 'DOWNLOAD_TIMEOUT': 600, 'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter', 'FEED_FORMAT': 'csv', 'FEED_URI': 'results/%(name)s_%(time)s.csv', 'HTTPCACHE_ENABLED': True, 'HTTPCACHE_EXPIRATION_SECS': 43200, 'HTTPCACHE_STORAGE': 'scrapy_splash.SplashAwareFSCacheStorage', 'NEWSPIDER_MODULE': 'scraping.spiders', 'SPIDER_MODULES': ['scraping.spiders'], 'USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36'}
2020-06-16 18:42:03 [scrapy.extensions.telnet] INFO: Telnet Password: e179fe629b29425b
2020-06-16 18:42:03 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
>>>>>>>>>>>>>>>>>__init__(MODES)<<<<<<<<<<<<<<<<<
2020-06-16 18:42:03 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy_crawlera.CrawleraMiddleware',
 'scrapy_splash.SplashCookiesMiddleware',
 'scrapy_splash.SplashMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats',
 'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware']
2020-06-16 18:42:03 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy_splash.SplashDeduplicateArgsMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-06-16 18:42:03 [scrapy.middleware] INFO: Enabled item pipelines:
['scraping.pipelines.ScrapingPipeline']
2020-06-16 18:42:03 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): www.modes.com:443
2020-06-16 18:42:04 [urllib3.connectionpool] DEBUG: https://www.modes.com:443 "GET /jp/shopping/woman HTTP/1.1" 503 None
Unhandled error in Deferred:
2020-06-16 18:42:04 [twisted] CRITICAL: Unhandled error in Deferred:

Traceback (most recent call last):
  File "/Users/rnrnstar/opt/anaconda3/envs/python_modules/lib/python3.7/site-packages/scrapy/crawler.py", line 172, in crawl
    return self._crawl(crawler, *args, **kwargs)
  File "/Users/rnrnstar/opt/anaconda3/envs/python_modules/lib/python3.7/site-packages/scrapy/crawler.py", line 176, in _crawl
    d = crawler.crawl(*args, **kwargs)
  File "/Users/rnrnstar/opt/anaconda3/envs/python_modules/lib/python3.7/site-packages/twisted/internet/defer.py", line 1613, in unwindGenerator
    return _cancellableInlineCallbacks(gen)
  File "/Users/rnrnstar/opt/anaconda3/envs/python_modules/lib/python3.7/site-packages/twisted/internet/defer.py", line 1529, in _cancellableInlineCallbacks
    _inlineCallbacks(None, g, status)
--- <exception caught here> ---
  File "/Users/rnrnstar/opt/anaconda3/envs/python_modules/lib/python3.7/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
    result = g.send(result)
  File "/Users/rnrnstar/opt/anaconda3/envs/python_modules/lib/python3.7/site-packages/scrapy/crawler.py", line 81, in crawl
    start_requests = iter(self.spider.start_requests())
  File "/Users/rnrnstar/github/Spiders/scraping/spiders/modes.py", line 41, in start_requests
    data = scraper.get("https://www.modes.com/jp/shopping/woman").content
  File "/Users/rnrnstar/opt/anaconda3/envs/python_modules/lib/python3.7/site-packages/requests/sessions.py", line 543, in get
    return self.request('GET', url, **kwargs)
  File "/Users/rnrnstar/opt/anaconda3/envs/python_modules/lib/python3.7/site-packages/cfscrape/__init__.py", line 129, in request
    resp = self.solve_cf_challenge(resp, **kwargs)
  File "/Users/rnrnstar/opt/anaconda3/envs/python_modules/lib/python3.7/site-packages/cfscrape/__init__.py", line 207, in solve_cf_challenge
    answer, delay = self.solve_challenge(body, domain)
  File "/Users/rnrnstar/opt/anaconda3/envs/python_modules/lib/python3.7/site-packages/cfscrape/__init__.py", line 299, in solve_challenge
    % BUG_REPORT
builtins.ValueError: Unable to identify Cloudflare IUAM Javascript on website. Cloudflare may have changed their technique, or there may be a bug in the script.

Please read https://github.com/Anorov/cloudflare-scrape#updates, then file a bug report at https://github.com/Anorov/cloudflare-scrape/issues."

2020-06-16 18:42:04 [twisted] CRITICAL: 
Traceback (most recent call last):
  File "/Users/rnrnstar/opt/anaconda3/envs/python_modules/lib/python3.7/site-packages/cfscrape/__init__.py", line 259, in solve_challenge
    javascript, flags=re.S
AttributeError: 'NoneType' object has no attribute 'groups'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/rnrnstar/opt/anaconda3/envs/python_modules/lib/python3.7/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
    result = g.send(result)
  File "/Users/rnrnstar/opt/anaconda3/envs/python_modules/lib/python3.7/site-packages/scrapy/crawler.py", line 81, in crawl
    start_requests = iter(self.spider.start_requests())
  File "/Users/rnrnstar/github/Spiders/scraping/spiders/modes.py", line 41, in start_requests
    data = scraper.get("https://www.modes.com/jp/shopping/woman").content
  File "/Users/rnrnstar/opt/anaconda3/envs/python_modules/lib/python3.7/site-packages/requests/sessions.py", line 543, in get
    return self.request('GET', url, **kwargs)
  File "/Users/rnrnstar/opt/anaconda3/envs/python_modules/lib/python3.7/site-packages/cfscrape/__init__.py", line 129, in request
    resp = self.solve_cf_challenge(resp, **kwargs)
  File "/Users/rnrnstar/opt/anaconda3/envs/python_modules/lib/python3.7/site-packages/cfscrape/__init__.py", line 207, in solve_cf_challenge
    answer, delay = self.solve_challenge(body, domain)
  File "/Users/rnrnstar/opt/anaconda3/envs/python_modules/lib/python3.7/site-packages/cfscrape/__init__.py", line 299, in solve_challenge
    % BUG_REPORT
ValueError: Unable to identify Cloudflare IUAM Javascript on website. Cloudflare may have changed their technique, or there may be a bug in the script.

Please read https://github.com/Anorov/cloudflare-scrape#updates, then file a bug report at https://github.com/Anorov/cloudflare-scrape/issues."

Complete exception and traceback

(If the problem doesn't involve an exception being raised, leave this blank)

URL of the Cloudflare-protected page

[LINK GOES HERE]

URL of Pastebin/Gist with HTML source of protected page

[LINK GOES HERE]

Sraq-Zit commented 4 years ago

Try this #373 I tested it with your link and it worked