apify / proxy-chain

Node.js implementation of a proxy server (think Squid) with support for SSL, authentication and upstream proxy chaining.
https://www.npmjs.com/package/proxy-chain
Apache License 2.0
851 stars 145 forks source link

Inconsistent HTTP Header Formatting Across Environments #552

Closed fduman closed 2 months ago

fduman commented 2 months ago

We are using Node.js 20.9.0 and proxy-chain 2.3.0 within crawlee 3.11.2.

Our proxy uses basic authentication, and we pass the proxy URL along with credentials to PlaywrightCrawler from crawlee.

While proxy-chain works without any issues on my local machine, it fails in our staging and production environments. All environments use Linux x64 containers.

To troubleshoot, I patched proxy-chain on the container to enable verbose mode and added extra log lines to ensure that the correct credentials were being sent.

Here’s a sample log output:

INFO  PlaywrightCrawler: Starting the crawler.
ProxyServer[32941]: Listening...
ProxyServer[32941]: !!! Handling CONNECT example.com:443 HTTP/1.1
ProxyServer[32941]: Using upstream proxy http://<redacted>:<redacted>@51.X.X.X:8888/
ProxyServer[32941]: Using chain() => example.com:443
ProxyServer[32941]: Failed to authenticate upstream proxy: 407 host,example.com:443,proxy-authorization,Basic Ym4..<redacted>
ERROR PlaywrightCrawler: Request failed and reached maximum retries. Error: Detected a session error, rotating session...
page.goto: net::ERR_TUNNEL_CONNECTION_FAILED at https://example.com/

I confirmed that the correct credentials were sent. We used tcpdump for packet inspection on the proxy machine, and here’s what we found:

CONNECT www.example.com:443 HTTP/1.1
0: host
1: www.example.com:443
2: proxy-authorization
3: Basic xxx=
Host: 127.0.0.1:8888
Connection: keep-alive

As you can see, the HTTP headers are not well formatted. I reviewed the arguments for Node.js's http.request function, and it specifies that headers should be sent as a dictionary, but in this case, they were being sent as an array. I am unsure why proxy-chain behaves inconsistently across environments.

To resolve the issue, I patched proxy-chain on the container. Below are the changes:

Original code:

const options = {
    method: 'CONNECT',
    path: request.url,
    headers: [
        'host',
        request.url,
    ],
    localAddress: handlerOpts.localAddress,
    family: handlerOpts.ipFamily,
    lookup: handlerOpts.dnsLookup,
};
if (proxy.username || proxy.password) {
    options.headers.push('proxy-authorization', (0, get_basic_1.getBasicAuthorizationHeader)(proxy));
}

Modified code:

const options = {
    method: 'CONNECT',
    path: request.url,
    headers: { 'host': request.url },
    localAddress: handlerOpts.localAddress,
    family: handlerOpts.ipFamily,
    lookup: handlerOpts.dnsLookup,
};
if (proxy.username || proxy.password) {
    options.headers['proxy-authorization'] = get_basic_1.getBasicAuthorizationHeader(proxy);
}

After applying this modification, the issue was resolved. Any comments?

jirimoravcik commented 2 months ago

Hello, this has already been discussed in the past, e.g. see https://github.com/apify/proxy-chain/pull/528

fduman commented 2 months ago

This is probably another Sentry issue as mentioned in #547.

fduman commented 2 months ago

I can confirm that the main issue was Sentry.