Ehsan-U / scrapy-nodriver

Nodriver integration for Scrapy
13 stars 1 forks source link

Mixed Status Codes in Request and Response Objects #2

Closed ThinksFast closed 2 months ago

ThinksFast commented 2 months ago

I made a test spider to see how no-driver renders javascript content, and I'm seeing a strange issue where the original response gets a 403 status code, but the response object contains a 200 status code, and the HTML is raw / unrendered by chrome.

Here is the test scraper I wrote:

import scrapy
from scrapy.responsetypes import Response

USER_AGENT: str = "Mozilla/5.0 (iPhone; CPU iPhone OS 17_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) CriOS/128.0.6613.18 Mobile/15E148 Safari/604.1"

class RenderSpider(scrapy.Spider):
    name = "render_nodriver_test"

    def __init__(self, *args, **kwargs):
        super(RenderSpider, self).__init__(*args, **kwargs)

    custom_settings = {
        "DOWNLOAD_HANDLERS": {
            "http": "scrapy_nodriver.handler.ScrapyNodriverDownloadHandler",
            "https": "scrapy_nodriver.handler.ScrapyNodriverDownloadHandler",
        },
        "LOG_LEVEL": "DEBUG",
        "REACTOR_THREADPOOL_MAXSIZE": 20,
        "RETRY_EXCEPTIONS": [
            "twisted.internet.error.TimeoutError",
            "twisted.internet.defer.TimeoutError",
        ],
        "SCHEDULER_PRIORITY_QUEUE": "scrapy.pqueues.DownloaderAwarePriorityQueue",
        "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
        "USER_AGENT": USER_AGENT,
    }

    def start_requests(self):
        start_urls = ["https://www.nodeposit365.com/casinos/bc-game-casino/"]

        for url in start_urls:
            yield scrapy.Request(
                url,
                callback=self.parse,
                meta={
                    "nodriver": True,
                    "errback": self.errback,
                },
            )

    async def parse(self, resp: Response):
        self.logger.info(f"Request URL: {resp.request.url}")
        self.logger.info(f"Final URL: {resp.url}")
        self.logger.info(f"Status Code: {resp.status}")
        self.logger.info(f"HTML: {resp.css('*').get()}")

        yield {"url": resp.url}

    async def errback(self, failure):
        self.logger.exception(f"⛔️ Errback Exception: {failure}")

And here is the log output from running scrapy crawl render_nodriver_test in my terminal:

2024-08-23 12:08:40 [scrapy.utils.log] INFO: Scrapy 2.11.2 started (bot: scrapy_crawler)
2024-08-23 12:08:40 [scrapy.utils.log] INFO: Versions: lxml 5.3.0.0, libxml2 2.12.9, cssselect 1.2.0, parsel 1.9.1, w3lib 2.2.1, Twisted 24.7.0, Python 3.11.6 | packaged by conda-forge | (main, Oct  3 2023, 10:37:07) [Clang 15.0.7 ], pyOpenSSL 24.2.1 (OpenSSL 3.3.1 4 Jun 2024), cryptography 43.0.0, Platform macOS-14.6.1-arm64-arm-64bit
2024-08-23 12:08:40 [scrapy.addons] INFO: Enabled addons:
[]
2024-08-23 12:08:40 [asyncio] DEBUG: Using selector: KqueueSelector
2024-08-23 12:08:40 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2024-08-23 12:08:40 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.unix_events._UnixSelectorEventLoop
2024-08-23 12:08:40 [scrapy.extensions.telnet] INFO: Telnet Password: 12fc74e4bb37fac4
2024-08-23 12:08:40 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2024-08-23 12:08:40 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'scrapy_crawler',
 'NEWSPIDER_MODULE': 'src.app.lib.scraper.bots',
 'REACTOR_THREADPOOL_MAXSIZE': 20,
 'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
 'RETRY_EXCEPTIONS': ['twisted.internet.error.TimeoutError',
                      'twisted.internet.defer.TimeoutError'],
 'SCHEDULER_PRIORITY_QUEUE': 'scrapy.pqueues.DownloaderAwarePriorityQueue',
 'SPIDER_MODULES': ['src.app.lib.scraper.bots'],
 'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor',
 'USER_AGENT': 'Mozilla/5.0 (iPhone; CPU iPhone OS 17_2 like Mac OS X) '
               'AppleWebKit/605.1.15 (KHTML, like Gecko) CriOS/128.0.6613.18 '
               'Mobile/15E148 Safari/604.1'}
2024-08-23 12:08:40 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2024-08-23 12:08:40 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2024-08-23 12:08:40 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2024-08-23 12:08:40 [scrapy.core.engine] INFO: Spider opened
2024-08-23 12:08:40 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-08-23 12:08:40 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2024-08-23 12:08:40 [scrapy-nodriver] INFO: Starting download handler
2024-08-23 12:08:40 [scrapy-nodriver] INFO: Starting download handler
2024-08-23 12:08:46 [scrapy-nodriver] DEBUG: New page created, page count is 1
2024-08-23 12:08:46 [scrapy-nodriver] DEBUG: Request: <GET https://www.nodeposit365.com/casinos/bc-game-casino/> (resource type: Document)
2024-08-23 12:08:47 [scrapy-nodriver] DEBUG: Response: <403 https://www.nodeposit365.com/casinos/bc-game-casino/>
2024-08-23 12:08:47 [scrapy-nodriver] DEBUG: Request: <GET https://ct.captcha-delivery.com/c.js> (resource type: Script, referrer: https://www.nodeposit365.com/)
2024-08-23 12:08:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.nodeposit365.com/casinos/bc-game-casino/> (referer: None) ['nodriver']
2024-08-23 12:08:47 [render_nodriver_test] INFO: Request URL: https://www.nodeposit365.com/casinos/bc-game-casino/
2024-08-23 12:08:47 [render_nodriver_test] INFO: Final URL: https://www.nodeposit365.com/casinos/bc-game-casino/
2024-08-23 12:08:47 [render_nodriver_test] INFO: Status Code: 200
2024-08-23 12:08:47 [render_nodriver_test] INFO: HTML: <html><head><title>nodeposit365.com</title><style>#cmsg{animation: A 1.5s;}@keyframes A{0%{opacity:0;}99%{opacity:0;}100%{opacity:1;}}</style></head><body style="margin:0"><p id="cmsg">Please enable JS and disable any ad blocker</p><script data-cfasync="false">var dd={'rt':'c','cid':'AHrlqAAAAAMA1iakIM8IMDQAZibHBw==','hsh':'47FD5E27C3F4545A3AEA18602AAD93','t':'fe','s':16943,'e':'3083ae772e46b9639b244b3ea22836e3978b10719051504724e066a171ff8f78','host':'geo.captcha-delivery.com'}</script><script data-cfasync="false" src="https://ct.captcha-delivery.com/c.js"></script></body></html>
2024-08-23 12:08:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.nodeposit365.com/casinos/bc-game-casino/>
{'url': 'https://www.nodeposit365.com/casinos/bc-game-casino/'}
2024-08-23 12:08:47 [scrapy.core.engine] INFO: Closing spider (finished)
2024-08-23 12:08:47 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 351,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 903,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'elapsed_time_seconds': 6.944,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2024, 8, 23, 16, 8, 47, 777989, tzinfo=datetime.timezone.utc),
 'item_scraped_count': 1,
 'log_count/DEBUG': 9,
 'log_count/INFO': 16,
 'memusage/max': 159973376,
 'memusage/startup': 159973376,
 'nodriver/page_count': 1,
 'nodriver/page_count/closed': 1,
 'nodriver/page_count/max_concurrent': 1,
 'nodriver/request_count': 2,
 'nodriver/request_count/resource_type/Document': 1,
 'nodriver/request_count/resource_type/Script': 1,
 'nodriver/response_count': 1,
 'nodriver/response_count/resource_type/Document': 1,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2024, 8, 23, 16, 8, 40, 833989, tzinfo=datetime.timezone.utc)}
2024-08-23 12:08:47 [scrapy.core.engine] INFO: Spider closed (finished)
2024-08-23 12:08:47 [scrapy-nodriver] INFO: Closing download handler
2024-08-23 12:08:47 [scrapy-nodriver] INFO: Closing browser
2024-08-23 12:08:47 [scrapy-nodriver] INFO: Closing download handler
2024-08-23 12:08:47 [scrapy-nodriver] INFO: Closing browser

You can see a 403 status code was received in the Response, but a 200 is reported in the Response object. I'm also printing the HTML, which you can see is full of capcha references. The HTML is also not javascript rendered.

I'm using a VPN for this request, which is probably why a captcha is getting triggered, but the mix of status codes and the un-rendered HTML seems like separate issues.

Did I configure the spider and no-driver settings correctly?

Ehsan-U commented 2 months ago

@ThinksFast Thanks for bringing this to my attention! The issue has been fixed. #2bbe23c

ThinksFast commented 2 months ago

@Ehsan-U Thanks for the quick fixes 🙂. I can confirm that the status code is getting correctly passed to the final response object.

But it looks like the javascript on the page is not getting rendered, and for the URL in the example above, the request is still getting blocked, even when I am not on a VPN.

Is there anything I can change in the configuration to render the JS, or improve my chances of not getting captcha checks?

Ehsan-U commented 2 months ago

@ThinksFast JS is being rendered properly, 403 ( captcha ) is appearing even when manually open the website in the chrome. so it seems like consistent captcha implementation by the site. even for real users. image

Ehsan-U commented 2 months ago

@ThinksFast This package is essentially an integration of Nodriver with Scrapy. If you're experiencing issues related to bypassing certain kind of captchas, please report them in the upstream Nodriver repository.

ThinksFast commented 2 months ago

@Ehsan-U Agreed on bypassing captchas, I'm sure no-driver won't get by all systems, but I'm hoping it's a lot better than Playwright.

But regarding the rendering of the HTML, when I print the HTML of the response object in the example code above, I get this in the logs:

<html>

<head>
    <title>nodeposit365.com</title>
    <style>
        #cmsg {
            animation: A 1.5s;
        }

        @keyframes A {
            0% {
                opacity: 0;
            }

            99% {
                opacity: 0;
            }

            100% {
                opacity: 1;
            }
        }
    </style>
</head>

<body style="margin:0">
    <p id="cmsg">Please enable JS and disable any ad blocker</p>
    <script
        data-cfasync="false">var dd = { 'rt': 'c', 'cid': 'AHrlqAAAAAMAGY6sUjlIluYA8DZhVQ==', 'hsh': '47FD5E27C3F4545A3AEA18602AAD93', 't': 'fe', 's': 16943, 'e': 'efdf718c9ab36b366ff60d84d929d6fc0e4986b45dc54b4c4ecba820da96d792', 'host': 'geo.captcha-delivery.com' }</script>
    <script data-cfasync="false" src="https://ct.captcha-delivery.com/c.js"></script>
</body>

</html>

But when I open the page in Chrome browser, and copy the HTML from the dev tools inspector, I get this:

<html>

<head>
    <title>nodeposit365.com</title>
    <style>
        #cmsg {
            animation: A 1.5s;
        }

        @keyframes A {
            0% {
                opacity: 0;
            }

            99% {
                opacity: 0;
            }

            100% {
                opacity: 1;
            }
        }
    </style>
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
</head>

<body style="margin:0">
    <script
        data-cfasync="false">var dd = { 'rt': 'c', 'cid': 'AHrlqAAAAAMAizR_Iog0P0EAXoleBQ==', 'hsh': '47FD5E27C3F4545A3AEA18602AAD93', 't': 'fe', 's': 16943, 'e': '7c5c2cfb864f1a67075173025b3e05ff909bbf16202b7fb1e4284388bf5e45d7', 'host': 'geo.captcha-delivery.com' }</script>
    <script data-cfasync="false" src="https://ct.captcha-delivery.com/c.js"></script><iframe
        src="https://geo.captcha-delivery.com/captcha/?initialCid=AHrlqAAAAAMAizR_Iog0P0EAXoleBQ%3D%3D&amp;hash=47FD5E27C3F4545A3AEA18602AAD93&amp;cid=ThsqpJsRaY51M56Oa42aD9E4vVymnDcGm2ZR1DRQ764hJ5olzh4u_aC_SpvF7eJCyqElz77wdjvD44JJKacWpuEaXjLMzSPNERDo96zgywCGaxfMbab3EYL0LvdLDf0~&amp;t=fe&amp;referer=https%3A%2F%2Fwww.nodeposit365.com%2Fcasinos%2Fbc-game-casino%2F&amp;s=16943&amp;e=7c5c2cfb864f1a67075173025b3e05ff909bbf16202b7fb1e4284388bf5e45d7&amp;dm=cd"
        sandbox="allow-scripts allow-same-origin allow-forms" width="100%" height="100%" style="height:100vh;"
        frameborder="0" border="0" scrolling="yes"></iframe>
</body>

</html>

The code is different, but notably, you can see the code printed in the logs says Please enable JS and disable any ad blocker, while the code rendered in my real Chrome browser does not have that text. So I don't think I'm getting the rendered HTML, just the initial response.