apify / got-scraping

HTTP client made for scraping based on got.
422 stars 32 forks source link

HTTP traffic over HTTPS proxy #126

Closed barjin closed 3 months ago

barjin commented 6 months ago

After looking into this Crawlee issue, it seems that we handle HTTP requests routed through HTTPS proxy (the request to the proxy server goes over TLS) incorrectly.

The logic picking the correct Agent for the proxy / request combination is here (reader - open this, check it out for 10 seconds and come back, please).

This tells us that different requests are routed through different Agents.


Now, can I get another pair of eyes to confirm what is happening with the different Agents in this file?

From what it seems to me:

It seems we have two attributes to care for:

The former we can simply solve by checking the proxy/target URL protocols - the latter seems like a thing dependent on the proxy used. Is there any way to find out whether the proxy server supports only the pathname / the CONNECT method? Is one of them (much) more prevalent? Can HTTPS traffic go only over CONNECT? (this I would assume, otherwise it sounds like some really bad MITM opportunity). But even in that case, what do we pick for the (now broken) HTTP request over HTTPS proxy?