Closed ksmib closed 6 months ago
Tagging subscribers to this area: @dotnet/ncl See info in area-owners.md if you want to be subscribed.
@wfurt could you help here? This is beyond my expertise.
In essence this looks like dup of #66244. Do you expect base auth for the proxy @ksmib or do you expect something else?
Besides just sending the credentials, we could have separate property as mentioned on #66244. Also @blowdart and @GrabYourPitchforks may have some take on this.
Since the proxy is generally single site I feel it would be reasonable to remember if it required authentication and send the credentials on subsequent request. Even if basic auth, the damage has been already done. That would at least eliminate the issue with many 407.
Yes, basic auth is what these proxy providers use. It seems unreasonable to me if they request less common auth schemes but ban with too many 407 at the same time
Issue #66244 author talks about the extra round trip, which doesn't bother me much. The critical issue, as mentioned by another user matteocontrini, is the blacklist situation.
I think remembering proxy auth is a good take as well besides the PreAuthenticateProxy. However backconnecting proxies use the same address but different ports for different proxies, we need separate HttpClient instances for different proxies. If the remember feature works per HttpClient, it might still lead to bans due to too many initial 407s in larger scale scraping. In my test case, bans occurred with 50-100 of 407 triggers in a few seconds.
remember feature works per HttpClient
Yes, it's even per connection pool. I.e. different proxy URIs will lead to different cache even on the same client. So if you're connection via different port to the same proxy, pre-auth would not help here. @wfurt please correct me if I'm reading the code incorrectly here.
Triage: may be related to https://github.com/dotnet/runtime/issues/93340 which we aim to do something about in 9.0, so let's try to include this one as well.
Triage: may be related to #93340 which we aim to do something about in 9.0, so let's try to include this one as well.
Not sure if CredentialCache would solve this issue since @ManickaP had pointed out cache works per connection pool. The proxies are of same host but different ports which require multiple HttpClient instances
That may be what the code does today. Even for the servers, HttpClient keeps cache of URL paths that needs pre-authentication. We can do something similar for proxy(es) across the pools. In my mind the first connection would get 407 and authenticate but all subsequent connections would have the Proxy-Authorization
headers if connecting to the same proxy. e.g. for each HttpClient/HttpHandler, you would see one failed attempt for any given proxy. (+plus any race conditions on start)
That should IMHO avoid the "triggering dozens of "407 Proxy Authentication Required" errors in a short time.". But are we talking about few different proxies? Hundreds? Thousands? And are they cooperating e.g. would one 407 for given proxy account also for different proxies?
While I feel we can make some improvements, I also think this is somewhat corner case e.g. low propriety comparing to some of the other issues.
That may be what the code does today. Even for the servers, HttpClient keeps cache of URL paths that needs pre-authentication. We can do something similar for proxy(es) across the pools. In my mind the first connection would get 407 and authenticate but all subsequent connections would have the
Proxy-Authorization
headers if connecting to the same proxy. e.g. for each HttpClient/HttpHandler, you would see one failed attempt for any given proxy. (+plus any race conditions on start)That should IMHO avoid the "triggering dozens of "407 Proxy Authentication Required" errors in a short time.". But are we talking about few different proxies? Hundreds? Thousands? And are they cooperating e.g. would one 407 for given proxy account also for different proxies?
While I feel we can make some improvements, I also think this is somewhat corner case e.g. low propriety comparing to some of the other issues.
Got it. When web scraping, we mostly use proxy providers that charge by bandwidth rather than proxy amount, so it's common to use hundreds of proxies. In my scenario, using about 150 proxies at once quickly led to a blacklist situation. This ban affects the connecting IP, stopping new connections for the remaining proxies.
With the new implementation, while the blacklist issue might not disappear entirely, it's likely to only happen initially and then stabilize, which is a significant improvement over the current state.
Description
When using HttpClient with a proxy, it defaults to omitting proxy authentication in its initial CONNECT request, sending the credentials only after receiving a 407 HTTP status code from the proxy.
Unlike with server authentication, where SocketHttpHandler.PreAuthenticate allows for sending credentials preemptively, there is no equivalent option for proxy authentication.
This behavior limits its suitability for web scraping tasks. Every major proxy providers are using backconnecting proxy which means clients are connecting to same address even for different proxies
These proxy providers will deny client's new connection temporarily after triggering dozens of "407 Proxy Authentication Required" errors in a short time.
For those unfamiliar with the underlying issue, it will appear as though majority of their requests are failing, resulting in an HttpRequestException.
The lack of a preemptive proxy authentication feature makes .NET unsuitable for web scraping.
In contrast, curl and Python's urllib send proxy authentication credentials on the initial CONNECT request by default.
I tried manually adding a "Proxy-Authorization" header but it helps only with HTTP requests but not with HTTPS requests (which are most of the cases) since CONNECT request headers are independent.
The only workaround I could think of is that similar to SocketHttpHandler.PreAuthenticate for preemptive server authenication, adding a PreAuthenticateProxy to add "Proxy-Authorization" header on initial CONNECT request.