dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.14k stars 4.71k forks source link

Per-request proxies in HttpClient #35992

Open LeaFrock opened 5 years ago

LeaFrock commented 5 years ago

I'm developing a web crawler framework. As usual, the web crawler needs a pool of proxies to send HTTP messages. For example.

                public async Task<string> GetRspTextByProxyAsync(IWebProxy proxy)
        {
            var handler = new SocketsHttpHandler()
            {
                UseProxy = true,
                Proxy = proxy
            };
            using (var client = new HttpClient(handler))
            {
                return await client.GetStringAsync("https://github.com");//just for example
            }
        }

As we all know, 'new HttpClient()' is not a recommended way of creating clients and it'll cause exceptions related to exhaustion of sockets while too many clients are created. Therefore, i want to use IHttpFactory instead. But there're still some problems. As I know, DefaultHttpFactory realizes the reuse of HttpMessageHandler. But one handler must be created for one proxy. If I have thousands of proxies, it means I have to create thousands of handlers. Also, I don't know how to use DI at this situation. I've searched StackOverFlow for solutions, like below,

                                        services.AddHttpClient("proxy1").ConfigurePrimaryHttpMessageHandler(() => new SocketsHttpHandler()
                    {
                        UseProxy = true,
                        Proxy = new WebProxy("http://localhost", 8888)
                    });
                    services.AddHttpClient("proxy2").ConfigurePrimaryHttpMessageHandler(() => new SocketsHttpHandler()
                    {
                        UseProxy = true,
                        Proxy = new WebProxy("http://localhost", 8889)
                    });
                    services.AddHttpClient("proxy3").ConfigurePrimaryHttpMessageHandler(() => new SocketsHttpHandler()
                    {
                        UseProxy = true,
                        Proxy = new WebProxy("http://localhost", 8890)
                    });
                    services.AddHostedService<MyHostedService>();

That's too ugly and I need to create clients dynamically because the amount of proxies is unknown before running. I can't write these hard codes. What shall we do to enjoy the benefits of HttpClient(with thousands of webproxies) and avoid the side effect? I'm looking forward to suggestions and guidance. Thank you in advance.

myFirstway commented 5 years ago

What should I do about this?

huoshan12345 commented 5 years ago

same issue here. Think about this case: I have a web proxy pool and I have a thread which is adding new proxy to the pool continuously. Now if I want to use that proxies, I have to create httpclients dynamically(get a proxy from the pool and create a httpclient with the proxy). How to achieve this with HttpClientFactory?

myFirstway commented 4 years ago

same issue here. Think about this case: I have a web proxy pool and I have a thread which is adding new proxy to the pool continuously. Now if I want to use that proxies, I have to create httpclients dynamically(get a proxy from the pool and create a httpclient with the proxy). How to achieve this with HttpClientFactory?

Now how do you solve this problem?

rynowak commented 4 years ago

See discussion here: https://github.com/aspnet/Extensions/issues/521#issuecomment-557943667

This is a fundamental limitation of HttpClient's design. We're going to look at a solution for this during 5.0

agertenbach commented 4 years ago

I ran into a similar problem related to contextual application of primary handler properties at runtime. I had a robust typed HttpClient with Polly and delegating handlers chained, but some property of the primary handler needed to be different based on a runtime attribute (i.e. different certificate based on URL, endpoint, user, etc.) and trying to register a half dozen versions of the same typed client pipeline did not make sense, even if I did want to set up all the handler configurations and certs within the composition root.

I wrote a library to extend the DefaultHttpClientFactory to resolve this: https://github.com/agertenbach/Ringleader https://www.nuget.org/packages/Ringleader/

It takes advantage of the fact that the primary handler management and pooling uses the name of the upstream named/typed client. By wrapping the options and tacking in an IHttpMessageHandlerBuilderFilter, we can parse out the original client name to resolve the client's pipeline and options, but then create/use a more granular primary handler in the handler pool based on some provided context when the typed client is requested. The library has two interfaces that have to be implemented, one to define the typed client and the context that you'll pass in (and translate it to a string), and then a second that can use that translated string to return a well-formed primary handler when a new one is requested by the DefaultHttpClientFactory.

Hope this helps -- feedback or comments welcome.

ghost commented 4 years ago

Tagging subscribers to this area: @dotnet/ncl Notify danmosemsft if you want to be subscribed.

ericstj commented 4 years ago

I still think this is important but it doesn't seem like we have the time to address this in 5.0. What do you think @dotnet/ncl? cc @davidfowl

scalablecory commented 4 years ago

You can implement a custom IWebProxy today that does round-robin proxy selection. Would this work for you, or do you need very explicit control over which proxy each request gets?

LeaFrock commented 4 years ago

You can implement a custom IWebProxy today that does round-robin proxy selection. Would this work for you, or do you need very explicit control over which proxy each request gets?

Using IWebProxy is far from being enough. For example, I hope to remove unavailable proxies, to select better proxies as a spare proxy pool, it requires the ‘explicit’ control of each proxy and flexibility of the strategy developers take.

LeaFrock commented 4 years ago

I still think this is important but it doesn't seem like we have the time to address this in 5.0. What do you think @dotnet/ncl? cc @davidfowl

Sorry to see that, but 'latter' usually brings 'better'. The design needs a careful thinking indeed.

See discussion here: dotnet/extensions#521 (comment)

This is a fundamental limitation of HttpClient's design. We're going to look at a solution for this during 5.0

scalablecory commented 4 years ago

You can implement a custom IWebProxy today that does round-robin proxy selection. Would this work for you, or do you need very explicit control over which proxy each request gets?

Using IWebProxy is far from being enough. For example, I hope to remove unavailable proxies, to select better proxies as a spare proxy pool, it requires the ‘explicit’ control of each proxy and flexibility of the strategy developers take.

Thanks for the example, that makes sense. We won't have anything for this in .NET 5, but can consider it for .NET 6. I'd love to see some more upvotes on the top comment to see interest.

ghost commented 4 years ago

Tagging subscribers to this area: @dotnet/ncl Notify danmosemsft if you want to be subscribed.

GF-Huang commented 3 years ago

This is why more people use python instead of c# to write crawlers, because .net httpclient is really lame.

Aleksej-Shherbak commented 2 years ago

@scalablecory please, please add this feature! I remember I did a websites' parser. It' was very confusing for me to realize that there was no a simple, straight forward way to do that. One of the requirements was to use a new one proxy for each request. There was also a dynamic proxy adding demand, 'cause it's obviously if we are talking about web parsers. If the proxies stop working, does it mean that I have to update the application config (that contains that list of proxies) and redeploy the app? My customer just asked me: "Give me an admin panel where I could add or remove proxies".

Dotnet is perfect, but web parsing with proxy is still convoluted.

wfurt commented 2 years ago

Doing anything per-request is going to be expensive. This may force creation of new connection and that is basically similar to creating new HttpClient. Going back to initial comment from @LeaFrock: "If I have thousands of proxies" -> that is going to create thousands of connections regardless.

Would people mostly use multiple proxies for load balancing or is it functional e.g. different subdomains/destinations need different exit points?

LeaFrock commented 2 years ago

that is going to create thousands of connections regardless.

Excuse me, that's not the point. The point is that, when you have many proxies, you must prefer only one connection to one proxy, which means reusable.

For now, if I have a proxy pool, before sending a request with some proxy, I need to create a HttpClient instance. Then, if the proxy fails, I have to dispose the instance and remove references of it, right?

If the proxy works, I hope that next time, when my app uses it again, I can re-use the same connection last time as possible. So now the problem becomes that I have to create a HttpClient pool, which may cause larger waste of resources.

If I just create a transient instance for one proxy every time, will HttpClient try to use the same connection? What I know about is no, so you may create connection multiple times for the same proxy request. As I said above, this way may cause a SocketsExhaust exception after your app runs for a while. Fortunately, now we can use SocketsHttpHandler and set PooledConnectionIdleTimeout&PooledConnectionLifetime to control the lifetime of pooled connections, which reduce the frequency of the problem.

In the past time, I didn't know a lot about HttpClient and thought IHttpClientFactory could help. For now, I'm still wondering that if there is a better solution than what I mention above.

Suppose that HttpClient(or something else) provides an inner way of handling proxies, since it's closer to underlying hardware connections, the developers will get a simpler solution to build apps with a better performance.

Would people mostly use multiple proxies for...

.NET works well for almost all areas, and web crawler is a relatively cold area. What I report is the most common dev-scenario for web crawlers, but I think .NET does not provide enough best practice about it.

scalablecory commented 2 years ago

I appreciate the problem, @LeaFrock.

One possibility is that we update HttpRequestMessage with a Proxy member. A common problem with abstractions, this would mean any existing implementations would silently ignore it. Not horrible but also not ideal.

Thomas-GH-CA commented 2 years ago

A workaround to this issue for my use case is potentially to make a new named client with different credentials set. Does anyone know what a reasonable limit for named clients would be ? My use case could require around 50 and i am worried that would cause problems.

wfurt commented 2 years ago

I don't think 50 clients would create problem. This is far bellow common OS limits and GC capabilities. It should be fairly easy IMHO to create test setup and try it @Thomas-GH-CA.

Thomas-GH-CA commented 2 years ago

Thanks @wfurt , thanks for the response. Looking back my comment is vague with little details so apologies but I just wanted to get an idea if 50 seemed wacky or not and seems it isn't so i will go ahead and try it out.

davidfowl commented 2 years ago

Generally, setting the proxy, cookie container or client certs requires managing handler instances on your own. You can no longer follow the guidance of "use a single handler for your application" as you're now storing "per user" state on the handler instance itself.

LeaFrock commented 2 years ago

You can no longer follow the guidance of "use a single handler for your application" as you're now storing "per user" state on the handler instance itself.

This is true in essence. But I hope there will be some improvements here, to offload upper-level developers and help them move in the correct direction of best practices.

bugproof commented 2 years ago

Does any workaround for this issue exist? It seems like either re-create the client before making a new request or change WebProxy Address property before next request. Is there anything wrong with the second approach?

I think if I add client per proxy to IHttpClientFactory and change the used HttpClient instance when needed it will be fine.

Vijay-Nirmal commented 2 years ago

@bugproof I did a POC on using a single HttpClient instance with multiple proxy configurations in a round-robin faction, it works perfectly. But I haven't tested it in my actual application yet.

bugproof commented 2 years ago

@Vijay-Nirmal By changing Address property or implementing IWebProxy?

Vijay-Nirmal commented 2 years ago

@bugproof Implementing IWebProxy

The below code is just a quick POC, there might be issues in the code. Let me know if anyone knows how to improve the below code.

public class ManagedProxy : IWebProxy
{
    private readonly IProxyProvider _proxyProvider;

    public ManagedProxy(IProxyProvider proxyProvider)
    {
        _proxyProvider = proxyProvider;
    }

    public ICredentials? Credentials { get; set; }

    public Uri? GetProxy(Uri destination)
    {
        return _proxyProvider.GetProxy();
    }

    public bool IsBypassed(Uri host)
    {
        return false;
    }
}

public class ProxyProvider : IProxyProvider
{
    private Object proxyLock = new Object();
    private int _currentProxyIndex = 0;
    private readonly List<Uri?> _proxies = new List<Uri?>();

    public Uri? GetProxy()
    {
        lock (proxyLock) // May affect the performance
        {
            var proxy = _proxies.Count > 0 ? _proxies[_currentProxyIndex] : null;
            _currentProxyIndex = (_currentProxyIndex + 1) >= _proxies.Count ? 0 : _currentProxyIndex + 1;
            return proxy;
        }
    }

    // Codes to add proxies to the list. I had a code to web scrape proxies from a website and add it to the list
}
davidfowl commented 2 years ago

@Vijay-Nirmal what does this accomplish? How does one set the proxy?

Vijay-Nirmal commented 2 years ago

@davidfowl I had a background service that periodically add new proxies to the list and check for bad proxies from the list and removed them.

what does this accomplish? How does one set the proxy?

I need to web scrape a website which has IP based rate limiter. So, using this method, I could overcome that :)

gitlsl commented 2 years ago

I appreciate the problem, can we add proxy, cookie to HttpRequestMessage? Python's network request library (requests httpx) is very convenient. Maybe we can learn from it

davidfowl commented 2 years ago

@gitlsl Do you have a concrete API proposal we can work through?

Python's network request library (requests httpx) is very convenient. Maybe we can learn from it

Do you have a direct comparison for this scenario?

gitlsl commented 2 years ago
Doing anything per-request is going to be expensive. This may force creation of new connection and that is basically similar to creating new HttpClient.
Going back to initial comment from @LeaFrock: "If I have thousands of proxies" -> that is going to create thousands of connections regardless.

@davidfowl As other mentioned above, to achieve this effect, it will cost a lot to generate a lot of connections. In fact, I have not tested it. Under this requirement, I usually use Python directly.

Here is an interesting phenomenon. You can observe that many people in this post are from the same eastern country (including me). If I remember correctly, the people who pr sock5 proxy support are also from this country

For me,I wish the api looks like https://requests.readthedocs.io/en/latest/api/

  1. Independent header(include cookie) and proxy that can be set for each request , header and proxy is a prop of HttpRequestMessage , httpclient just send request and construct response
  2. Create a session from httpclient to maintain cookie in multiple requests

    Maybe we can maintain the compatibility of httpclient and add new feature into https://github.com/dotnet/runtimelab/tree/feature/LLHTTP2

pebezo commented 1 year ago

@gitlsl Do you have a concrete API proposal we can work through?

Python's network request library (requests httpx) is very convenient. Maybe we can learn from it

Do you have a direct comparison for this scenario?

With Python one can set/configure a proxy per-request at the time of request, for example:

import requests
url = "https://example.com"
proxy = "http://some-proxy.example.com"
proxies = {"http": proxy, "https": proxy}
response = requests.get(url, proxies=proxies, verify=False)

The issue I'm hitting is that the proxy information is NOT available at app startup. It may come from the database or some other way or it should be dynamically selected. As a general rule, if something cannot be configured dynamically (i.e. via a database query later) then it is too restrictive (i.e. not usable for a class of solutions).

The other issue is an architectural one: If I have to configure the proxy in one project while the actual usage is in another one then I'm spreading the logic into multiple places. It would be a lot cleaner (for some use cases) to isolate that logic to a single place.

NCLnclNCL commented 11 months ago

@gitlsl Do you have a concrete API proposal we can work through?

Python's network request library (requests httpx) is very convenient. Maybe we can learn from it

Do you have a direct comparison for this scenario?

With Python one can set/configure a proxy per-request at the time of request, for example:

import requests
url = "https://example.com"
proxy = "http://some-proxy.example.com"
proxies = {"http": proxy, "https": proxy}
response = requests.get(url, proxies=proxies, verify=False)

The issue I'm hitting is that the proxy information is NOT available at app startup. It may come from the database or some other way or it should be dynamically selected. As a general rule, if something cannot be configured dynamically (i.e. via a database query later) then it is too restrictive (i.e. not usable for a class of solutions).

The other issue is an architectural one: If I have to configure the proxy in one project while the actual usage is in another one then I'm spreading the logic into multiple places. It would be a lot cleaner (for some use cases) to isolate that logic to a single place.

yes

vinaghost commented 1 month ago

hi, .net is going to have version 9 any update on this issue?

https://github.com/dotnet/extensions/issues/521#issuecomment-557943667

wfurt commented 1 month ago

this is not going to happen in 9.