fabienvauchelles / scrapoxy

Scrapoxy is a super proxy aggregator, allowing you to manage all proxies in one place 🎯, rather than spreading it across multiple scrapers πŸ•ΈοΈ. It also smartly handles traffic routing πŸ”€ to minimize bans and increase success rates πŸš€.
http://scrapoxy.io
MIT License
1.99k stars 235 forks source link

Proxy CONNECT aborted on HTTPS redirection #3

Closed Florian95 closed 8 years ago

Florian95 commented 8 years ago

Hello,

Excellent work, but that does not seem to work with HTTP to HTTPS redirection, for example: http://github.com/fabienvauchelles/scrapoxy

HTTP :

$ curl -i --proxy http://127.0.0.1:8888 http://github.com/fabienvauchelles/scrapoxy
HTTP/1.1 301 Moved Permanently
content-length: 0
location: https://github.com/fabienvauchelles/scrapoxy
connection: close
x-cache-proxyname: i-ae00b817
Date: Tue, 08 Dec 2015 23:33:56 GMT

HTTPS :

$ curl --proxy http://127.0.0.1:8888 https://github.com/fabienvauchelles/scrapoxy
curl: (56) Proxy CONNECT aborted

Regards,

fabienvauchelles commented 8 years ago

Hello,

Thanks for your feedback :)

That is a normal behavior.

For HTTP/HTTPS, there are 2 methods:

Method REQUEST for HTTP

  1. A client makes an HTTP request to a proxy, in REQUEST mode;
  2. The proxy replays the same HTTP request to the target.

Method CONNECT for HTTPS

  1. A client makes an HTTP request to a proxy, in CONNECT mode;
  2. The proxy creates a TCP tunnel between the client and the target.

=> The proxy only views a TCP redirect. Content cannot be understand by the proxy.

This a huge problem because Scrapoxy changes HTTP headers on-the-fly (useragent).

To bypass the problem Scrapoxy accepts only HTTP proxy. For HTTPS request, you must URL with HTTPS with REQUEST mode. See Tutorial

Why cURL doesn't work ? cURL cannot works with HTTPS on REQUEST mode. If cURL detects an HTTPS URL, it will ask a CONNECT mode. More information here and here.

The Scrapy framework accepts HTTPS on REQUEST mode.

You must specify 'noconnect' in you proxy URL:

PROXY = 'http://127.0.0.1:8888/?noconnect'

Hope it will help, Fabien.