Open LeaFrock opened 5 years ago
What should I do about this?
same issue here. Think about this case: I have a web proxy pool and I have a thread which is adding new proxy to the pool continuously. Now if I want to use that proxies, I have to create httpclients dynamically(get a proxy from the pool and create a httpclient with the proxy). How to achieve this with HttpClientFactory?
same issue here. Think about this case: I have a web proxy pool and I have a thread which is adding new proxy to the pool continuously. Now if I want to use that proxies, I have to create httpclients dynamically(get a proxy from the pool and create a httpclient with the proxy). How to achieve this with HttpClientFactory?
Now how do you solve this problem?
See discussion here: https://github.com/aspnet/Extensions/issues/521#issuecomment-557943667
This is a fundamental limitation of HttpClient's design. We're going to look at a solution for this during 5.0
I ran into a similar problem related to contextual application of primary handler properties at runtime. I had a robust typed HttpClient with Polly and delegating handlers chained, but some property of the primary handler needed to be different based on a runtime attribute (i.e. different certificate based on URL, endpoint, user, etc.) and trying to register a half dozen versions of the same typed client pipeline did not make sense, even if I did want to set up all the handler configurations and certs within the composition root.
I wrote a library to extend the DefaultHttpClientFactory to resolve this: https://github.com/agertenbach/Ringleader https://www.nuget.org/packages/Ringleader/
It takes advantage of the fact that the primary handler management and pooling uses the name of the upstream named/typed client. By wrapping the options and tacking in an IHttpMessageHandlerBuilderFilter, we can parse out the original client name to resolve the client's pipeline and options, but then create/use a more granular primary handler in the handler pool based on some provided context when the typed client is requested. The library has two interfaces that have to be implemented, one to define the typed client and the context that you'll pass in (and translate it to a string), and then a second that can use that translated string to return a well-formed primary handler when a new one is requested by the DefaultHttpClientFactory.
Hope this helps -- feedback or comments welcome.
Tagging subscribers to this area: @dotnet/ncl Notify danmosemsft if you want to be subscribed.
I still think this is important but it doesn't seem like we have the time to address this in 5.0. What do you think @dotnet/ncl? cc @davidfowl
You can implement a custom IWebProxy
today that does round-robin proxy selection. Would this work for you, or do you need very explicit control over which proxy each request gets?
You can implement a custom
IWebProxy
today that does round-robin proxy selection. Would this work for you, or do you need very explicit control over which proxy each request gets?
Using IWebProxy is far from being enough. For example, I hope to remove unavailable proxies, to select better proxies as a spare proxy pool, it requires the ‘explicit’ control of each proxy and flexibility of the strategy developers take.
I still think this is important but it doesn't seem like we have the time to address this in 5.0. What do you think @dotnet/ncl? cc @davidfowl
Sorry to see that, but 'latter' usually brings 'better'. The design needs a careful thinking indeed.
See discussion here: dotnet/extensions#521 (comment)
This is a fundamental limitation of HttpClient's design. We're going to look at a solution for this during 5.0
You can implement a custom
IWebProxy
today that does round-robin proxy selection. Would this work for you, or do you need very explicit control over which proxy each request gets?Using IWebProxy is far from being enough. For example, I hope to remove unavailable proxies, to select better proxies as a spare proxy pool, it requires the ‘explicit’ control of each proxy and flexibility of the strategy developers take.
Thanks for the example, that makes sense. We won't have anything for this in .NET 5, but can consider it for .NET 6. I'd love to see some more upvotes on the top comment to see interest.
Tagging subscribers to this area: @dotnet/ncl Notify danmosemsft if you want to be subscribed.
This is why more people use python instead of c# to write crawlers, because .net httpclient is really lame.
@scalablecory please, please add this feature! I remember I did a websites' parser. It' was very confusing for me to realize that there was no a simple, straight forward way to do that. One of the requirements was to use a new one proxy for each request. There was also a dynamic proxy adding demand, 'cause it's obviously if we are talking about web parsers. If the proxies stop working, does it mean that I have to update the application config (that contains that list of proxies) and redeploy the app? My customer just asked me: "Give me an admin panel where I could add or remove proxies".
Dotnet is perfect, but web parsing with proxy is still convoluted.
Doing anything per-request is going to be expensive. This may force creation of new connection and that is basically similar to creating new HttpClient. Going back to initial comment from @LeaFrock: "If I have thousands of proxies" -> that is going to create thousands of connections regardless.
Would people mostly use multiple proxies for load balancing or is it functional e.g. different subdomains/destinations need different exit points?
that is going to create thousands of connections regardless.
Excuse me, that's not the point. The point is that, when you have many proxies, you must prefer only one connection to one proxy, which means reusable.
For now, if I have a proxy pool, before sending a request with some proxy, I need to create a HttpClient
instance. Then, if the proxy fails, I have to dispose the instance and remove references of it, right?
If the proxy works, I hope that next time, when my app uses it again, I can re-use the same connection last time as possible. So now the problem becomes that I have to create a HttpClient
pool, which may cause larger waste of resources.
If I just create a transient instance for one proxy every time, will HttpClient
try to use the same connection? What I know about is no, so you may create connection multiple times for the same proxy request. As I said above, this way may cause a SocketsExhaust
exception after your app runs for a while. Fortunately, now we can use SocketsHttpHandler
and set PooledConnectionIdleTimeout
&PooledConnectionLifetime
to control the lifetime of pooled connections, which reduce the frequency of the problem.
In the past time, I didn't know a lot about HttpClient
and thought IHttpClientFactory
could help. For now, I'm still wondering that if there is a better solution than what I mention above.
Suppose that HttpClient
(or something else) provides an inner way of handling proxies, since it's closer to underlying hardware connections, the developers will get a simpler solution to build apps with a better performance.
Would people mostly use multiple proxies for...
.NET works well for almost all areas, and web crawler is a relatively cold area. What I report is the most common dev-scenario for web crawlers, but I think .NET does not provide enough best practice about it.
I appreciate the problem, @LeaFrock.
One possibility is that we update HttpRequestMessage
with a Proxy
member. A common problem with abstractions, this would mean any existing implementations would silently ignore it. Not horrible but also not ideal.
A workaround to this issue for my use case is potentially to make a new named client with different credentials set. Does anyone know what a reasonable limit for named clients would be ? My use case could require around 50 and i am worried that would cause problems.
I don't think 50 clients would create problem. This is far bellow common OS limits and GC capabilities. It should be fairly easy IMHO to create test setup and try it @Thomas-GH-CA.
Thanks @wfurt , thanks for the response. Looking back my comment is vague with little details so apologies but I just wanted to get an idea if 50 seemed wacky or not and seems it isn't so i will go ahead and try it out.
Generally, setting the proxy, cookie container or client certs requires managing handler instances on your own. You can no longer follow the guidance of "use a single handler for your application" as you're now storing "per user" state on the handler instance itself.
You can no longer follow the guidance of "use a single handler for your application" as you're now storing "per user" state on the handler instance itself.
This is true in essence. But I hope there will be some improvements here, to offload upper-level developers and help them move in the correct direction of best practices.
Does any workaround for this issue exist? It seems like either re-create the client before making a new request or change WebProxy
Address
property before next request. Is there anything wrong with the second approach?
I think if I add client per proxy to IHttpClientFactory
and change the used HttpClient
instance when needed it will be fine.
@bugproof I did a POC on using a single HttpClient instance with multiple proxy configurations in a round-robin faction, it works perfectly. But I haven't tested it in my actual application yet.
@Vijay-Nirmal By changing Address
property or implementing IWebProxy
?
@bugproof Implementing IWebProxy
The below code is just a quick POC, there might be issues in the code. Let me know if anyone knows how to improve the below code.
public class ManagedProxy : IWebProxy
{
private readonly IProxyProvider _proxyProvider;
public ManagedProxy(IProxyProvider proxyProvider)
{
_proxyProvider = proxyProvider;
}
public ICredentials? Credentials { get; set; }
public Uri? GetProxy(Uri destination)
{
return _proxyProvider.GetProxy();
}
public bool IsBypassed(Uri host)
{
return false;
}
}
public class ProxyProvider : IProxyProvider
{
private Object proxyLock = new Object();
private int _currentProxyIndex = 0;
private readonly List<Uri?> _proxies = new List<Uri?>();
public Uri? GetProxy()
{
lock (proxyLock) // May affect the performance
{
var proxy = _proxies.Count > 0 ? _proxies[_currentProxyIndex] : null;
_currentProxyIndex = (_currentProxyIndex + 1) >= _proxies.Count ? 0 : _currentProxyIndex + 1;
return proxy;
}
}
// Codes to add proxies to the list. I had a code to web scrape proxies from a website and add it to the list
}
@Vijay-Nirmal what does this accomplish? How does one set the proxy?
@davidfowl I had a background service that periodically add new proxies to the list and check for bad proxies from the list and removed them.
what does this accomplish? How does one set the proxy?
I need to web scrape a website which has IP based rate limiter. So, using this method, I could overcome that :)
I appreciate the problem, can we add proxy
, cookie
to HttpRequestMessage
?
Python's network request library (requests
httpx
) is very convenient. Maybe we can learn from it
@gitlsl Do you have a concrete API proposal we can work through?
Python's network request library (requests httpx) is very convenient. Maybe we can learn from it
Do you have a direct comparison for this scenario?
Doing anything per-request is going to be expensive. This may force creation of new connection and that is basically similar to creating new HttpClient.
Going back to initial comment from @LeaFrock: "If I have thousands of proxies" -> that is going to create thousands of connections regardless.
@davidfowl As other mentioned above, to achieve this effect, it will cost a lot to generate a lot of connections. In fact, I have not tested it. Under this requirement, I usually use Python directly.
Here is an interesting phenomenon. You can observe that many people in this post are from the same eastern country (including me). If I remember correctly, the people who pr sock5 proxy support
are also from this country
For me,I wish the api looks like https://requests.readthedocs.io/en/latest/api/
header and proxy
is a prop of HttpRequestMessage
, httpclient
just send request
and construct response
Create a session from httpclient to maintain cookie in multiple requests
Maybe we can maintain the compatibility of httpclient and add new feature into https://github.com/dotnet/runtimelab/tree/feature/LLHTTP2
@gitlsl Do you have a concrete API proposal we can work through?
Python's network request library (requests httpx) is very convenient. Maybe we can learn from it
Do you have a direct comparison for this scenario?
With Python one can set/configure a proxy per-request at the time of request, for example:
import requests
url = "https://example.com"
proxy = "http://some-proxy.example.com"
proxies = {"http": proxy, "https": proxy}
response = requests.get(url, proxies=proxies, verify=False)
The issue I'm hitting is that the proxy information is NOT available at app startup. It may come from the database or some other way or it should be dynamically selected. As a general rule, if something cannot be configured dynamically (i.e. via a database query later) then it is too restrictive (i.e. not usable for a class of solutions).
The other issue is an architectural one: If I have to configure the proxy in one project while the actual usage is in another one then I'm spreading the logic into multiple places. It would be a lot cleaner (for some use cases) to isolate that logic to a single place.
@gitlsl Do you have a concrete API proposal we can work through?
Python's network request library (requests httpx) is very convenient. Maybe we can learn from it
Do you have a direct comparison for this scenario?
With Python one can set/configure a proxy per-request at the time of request, for example:
import requests url = "https://example.com" proxy = "http://some-proxy.example.com" proxies = {"http": proxy, "https": proxy} response = requests.get(url, proxies=proxies, verify=False)
The issue I'm hitting is that the proxy information is NOT available at app startup. It may come from the database or some other way or it should be dynamically selected. As a general rule, if something cannot be configured dynamically (i.e. via a database query later) then it is too restrictive (i.e. not usable for a class of solutions).
The other issue is an architectural one: If I have to configure the proxy in one project while the actual usage is in another one then I'm spreading the logic into multiple places. It would be a lot cleaner (for some use cases) to isolate that logic to a single place.
yes
hi, .net is going to have version 9 any update on this issue?
https://github.com/dotnet/extensions/issues/521#issuecomment-557943667
this is not going to happen in 9.
I'm developing a web crawler framework. As usual, the web crawler needs a pool of proxies to send HTTP messages. For example.
As we all know, 'new HttpClient()' is not a recommended way of creating clients and it'll cause exceptions related to exhaustion of sockets while too many clients are created. Therefore, i want to use IHttpFactory instead. But there're still some problems. As I know, DefaultHttpFactory realizes the reuse of HttpMessageHandler. But one handler must be created for one proxy. If I have thousands of proxies, it means I have to create thousands of handlers. Also, I don't know how to use DI at this situation. I've searched StackOverFlow for solutions, like below,
That's too ugly and I need to create clients dynamically because the amount of proxies is unknown before running. I can't write these hard codes. What shall we do to enjoy the benefits of HttpClient(with thousands of webproxies) and avoid the side effect? I'm looking forward to suggestions and guidance. Thank you in advance.