407 error on DO instances

mntolia commented 9 months ago

Current Behavior

When I use DO instances with my project I get a 407 error. I do not get the same error with using IPRoyal proxies with scrapoxy.

I use curl_cffi to emulate a browser's TLS fingerprint. It works fine with IP royale

Expected Behavior

I should get a 200 response code

Steps to Reproduce

Install & Setup DO Connector
Send a request

Failure Logs

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/curl_cffi/requests/session.py", line 929, in request
    await task
curl_cffi.curl.CurlError: Failed to perform, ErrCode: 56, Reason: 'Received HTTP code 407 from proxy after CONNECT'. This may be a libcurl error, See https://curl.se/libcurl/c/libcurl-errors.html first for more details.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/twisted/internet/defer.py", line 1693, in _inlineCallbacks
    result = context.run(
  File "/usr/local/lib/python3.10/dist-packages/twisted/python/failure.py", line 518, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/usr/local/lib/python3.10/dist-packages/scrapy/core/downloader/middleware.py", line 54, in process_request
    return (yield download_func(request=request, spider=spider))
  File "/usr/local/lib/python3.10/dist-packages/twisted/internet/defer.py", line 1065, in adapt
    extracted = result.result()
  File "/usr/local/lib/python3.10/dist-packages/scrapy_impersonate/handler.py", line 44, in _download_request
    response = await self.client.request(**RequestParser(request).as_dict())  # type: ignore
  File "/usr/local/lib/python3.10/dist-packages/curl_cffi/requests/session.py", line 934, in request
    raise RequestsError(str(e), e.code, rsp) from e
curl_cffi.requests.errors.RequestsError: Failed to perform, ErrCode: 56, Reason: 'Received HTTP code 407 from proxy after CONNECT'. This may be a libcurl error, See https://curl.se/libcurl/c/libcurl-errors.html first for more details.
2024-02-07 01:42:01 [scrapy.core.engine] INFO: Closing spider (finished)

Scrapoxy Version

latest

Custom Version

[X] No
[] Yes

Deployment

[X] Docker
[ ] Docker Compose
[ ] Kubernetes
[ ] NPM
[ ] Other (Specify in Additional Information)

Operating System

[X] Linux
[ ] Windows
[ ] macOS
[ ] Other (Specify in Additional Information)

Storage

[X] File (default)
[ ] MongoDB & RabbitMQ
[ ] Other (Specify in Additional Information)

Additional Information

EDIT: I also tried without curl_cffi library. I still get the same response.

2024-02-06 23:33:43 [scrapy.core.engine] INFO: Spider opened
2024-02-06 23:33:43 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-02-06 23:33:43 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2024-02-06 23:33:50 [scrapy.core.engine] DEBUG: Crawled (407) <GET https://www.ah.nl/sitemaps/entities/products/detail.xml> (referer: None)
2024-02-06 23:33:50 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <407 https://www.ah.nl/sitemaps/entities/products/detail.xml>: HTTP status code is not handled or not allowed
2024-02-06 23:33:50 [scrapy.core.engine] INFO: Closing spider (finished)

fabienvauchelles commented 9 months ago

Ok perfect. I will try to reproduce. Thanks

fabienvauchelles commented 9 months ago

Hello @mntolia ,

I use scrapy-impersonate with Scrapoxy:

Here is the spider:

from typing import Iterable

import scrapy
from scrapy import Request

class ExampleSpider(scrapy.Spider):
    name = "example"
    allowed_domains = ["www.ah.nl"]

    def start_requests(self) -> Iterable[Request]:
        yield Request(
            url="https://www.ah.nl/sitemaps/entities/products/detail.xml",
            dont_filter=True,
            meta={
                "impersonate": "chrome110",
                "impersonate_args": {
                    "verify": False,
                },
            },
            callback=self.parse
        )

    def parse(self, response):
        pass

And settings.py:

BOT_NAME = "testscrapy"
SPIDER_MODULES = ["testscrapy.spiders"]
NEWSPIDER_MODULE = "testscrapy.spiders"
ROBOTSTXT_OBEY = False
DOWNLOADER_MIDDLEWARES = {
    'scrapoxy.ProxyDownloaderMiddleware': 100,
}
REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"
SCRAPOXY_MASTER = "http://localhost:8888"
SCRAPOXY_API = "http://localhost:8890/api"
SCRAPOXY_USERNAME = "<USERNAME>"
SCRAPOXY_PASSWORD = "<PASSWORD>"
DOWNLOAD_HANDLERS = {
    "http": "scrapy_impersonate.ImpersonateDownloadHandler",
    "https": "scrapy_impersonate.ImpersonateDownloadHandler",
}

Scrapoxy has 1 droplet on Digital Ocean.

Requests correctly go through Scrapoxy (I've got 403 but it is the antibot).

Did you set "verify": False on the request?

mntolia commented 9 months ago

It worked Thanks @fabienvauchelles

It was indeed the issue with me not setting verify. I appreciate you taking the time to test!

fabienvauchelles commented 9 months ago

you're welcome. Thank for using Scrapoxy!

fabienvauchelles / scrapoxy