dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.26k stars 4.73k forks source link

HttpClient without initial proxy auth limits web scraping capability #100515

Closed ksmib closed 6 months ago

ksmib commented 7 months ago

Description

When using HttpClient with a proxy, it defaults to omitting proxy authentication in its initial CONNECT request, sending the credentials only after receiving a 407 HTTP status code from the proxy.

Unlike with server authentication, where SocketHttpHandler.PreAuthenticate allows for sending credentials preemptively, there is no equivalent option for proxy authentication.

This behavior limits its suitability for web scraping tasks. Every major proxy providers are using backconnecting proxy which means clients are connecting to same address even for different proxies

These proxy providers will deny client's new connection temporarily after triggering dozens of "407 Proxy Authentication Required" errors in a short time.

For those unfamiliar with the underlying issue, it will appear as though majority of their requests are failing, resulting in an HttpRequestException.

The lack of a preemptive proxy authentication feature makes .NET unsuitable for web scraping.

In contrast, curl and Python's urllib send proxy authentication credentials on the initial CONNECT request by default.

I tried manually adding a "Proxy-Authorization" header but it helps only with HTTP requests but not with HTTPS requests (which are most of the cases) since CONNECT request headers are independent.

The only workaround I could think of is that similar to SocketHttpHandler.PreAuthenticate for preemptive server authenication, adding a PreAuthenticateProxy to add "Proxy-Authorization" header on initial CONNECT request.

dotnet-policy-service[bot] commented 7 months ago

Tagging subscribers to this area: @dotnet/ncl See info in area-owners.md if you want to be subscribed.

ManickaP commented 7 months ago

@wfurt could you help here? This is beyond my expertise.

wfurt commented 7 months ago

In essence this looks like dup of #66244. Do you expect base auth for the proxy @ksmib or do you expect something else?

Besides just sending the credentials, we could have separate property as mentioned on #66244. Also @blowdart and @GrabYourPitchforks may have some take on this.

Since the proxy is generally single site I feel it would be reasonable to remember if it required authentication and send the credentials on subsequent request. Even if basic auth, the damage has been already done. That would at least eliminate the issue with many 407.

ksmib commented 7 months ago

Yes, basic auth is what these proxy providers use. It seems unreasonable to me if they request less common auth schemes but ban with too many 407 at the same time

Issue #66244 author talks about the extra round trip, which doesn't bother me much. The critical issue, as mentioned by another user matteocontrini, is the blacklist situation.

I think remembering proxy auth is a good take as well besides the PreAuthenticateProxy. However backconnecting proxies use the same address but different ports for different proxies, we need separate HttpClient instances for different proxies. If the remember feature works per HttpClient, it might still lead to bans due to too many initial 407s in larger scale scraping. In my test case, bans occurred with 50-100 of 407 triggers in a few seconds.

ManickaP commented 7 months ago

remember feature works per HttpClient

Yes, it's even per connection pool. I.e. different proxy URIs will lead to different cache even on the same client. So if you're connection via different port to the same proxy, pre-auth would not help here. @wfurt please correct me if I'm reading the code incorrectly here.

rzikm commented 7 months ago

Triage: may be related to https://github.com/dotnet/runtime/issues/93340 which we aim to do something about in 9.0, so let's try to include this one as well.

ksmib commented 6 months ago

Triage: may be related to #93340 which we aim to do something about in 9.0, so let's try to include this one as well.

Not sure if CredentialCache would solve this issue since @ManickaP had pointed out cache works per connection pool. The proxies are of same host but different ports which require multiple HttpClient instances

wfurt commented 6 months ago

That may be what the code does today. Even for the servers, HttpClient keeps cache of URL paths that needs pre-authentication. We can do something similar for proxy(es) across the pools. In my mind the first connection would get 407 and authenticate but all subsequent connections would have the Proxy-Authorization headers if connecting to the same proxy. e.g. for each HttpClient/HttpHandler, you would see one failed attempt for any given proxy. (+plus any race conditions on start)

That should IMHO avoid the "triggering dozens of "407 Proxy Authentication Required" errors in a short time.". But are we talking about few different proxies? Hundreds? Thousands? And are they cooperating e.g. would one 407 for given proxy account also for different proxies?

While I feel we can make some improvements, I also think this is somewhat corner case e.g. low propriety comparing to some of the other issues.

ksmib commented 6 months ago

That may be what the code does today. Even for the servers, HttpClient keeps cache of URL paths that needs pre-authentication. We can do something similar for proxy(es) across the pools. In my mind the first connection would get 407 and authenticate but all subsequent connections would have the Proxy-Authorization headers if connecting to the same proxy. e.g. for each HttpClient/HttpHandler, you would see one failed attempt for any given proxy. (+plus any race conditions on start)

That should IMHO avoid the "triggering dozens of "407 Proxy Authentication Required" errors in a short time.". But are we talking about few different proxies? Hundreds? Thousands? And are they cooperating e.g. would one 407 for given proxy account also for different proxies?

While I feel we can make some improvements, I also think this is somewhat corner case e.g. low propriety comparing to some of the other issues.

Got it. When web scraping, we mostly use proxy providers that charge by bandwidth rather than proxy amount, so it's common to use hundreds of proxies. In my scenario, using about 150 proxies at once quickly led to a blacklist situation. This ban affects the connecting IP, stopping new connections for the remaining proxies.

With the new implementation, while the blacklist issue might not disappear entirely, it's likely to only happen initially and then stabilize, which is a significant improvement over the current state.