Open MaceWindu opened 2 years ago
Also filled issue for CH https://github.com/ClickHouse/ClickHouse/issues/39374 as I don't see anything wrong in provider implementation
Hi @MaceWindu and @DarkWanderer, I ran into the same issue a while ago when running integration tests using containerized ClickHouse. It might relate to the CI environment as both server and client are running on a single core VM. The workaround I took is retry until timed out(along with an option to turn it off), which seems fixed random test failures.
ClickHouse only supports one query at a time within a session. My understanding is that server may not release resource immediately due to various reasons(e.g. client didn't close response stream timely or server is just too slow to fulfill previous client request etc). Hope this can be of use :)
ClickHouse inserts are synchronous by default. As I can see from client code bulk copy implementation doesn't check operation status (which is done for regular commands).
Just to give an update. I've figured out the cause for the error - it is because ClickHouse doesn't like it when calls from same session come from different HTTP connections, even if the queries never come concurrently. Workaround is "simple" - just pass a HttpClient
with a single-connection HttpClientHandler
:
new HttpClientHandler()
{
AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate,
MaxConnectionsPerServer = 1,
};
Fixing that in scope of the library is a more complex story, however - I'm thinking how to do that in a fashion which is both reasonably backward-compatible and restorative towards this issue
@DarkWanderer the problem with that approach is that requests will be sent sequentially with no concurrency at all since we will use only one connection per server. That means no concurrent requests (so one long-running query will block any other request until it finishes ).
It would be great if .NET allows select HTTP connection from connection pool by some condition (for example by a hash of a CH session id) but it seems that there is no way to do that.
Shouldn’t it be fixed on CH side? Maybe it’s just a bug in CH.
@nike61, note that we are talking about session-aware connections here, which already cannot be paralleled
@MaceWindu yes, I just wanted to point out that suggested parameter "MaxConnectionsPerServer = 1" will make all queries made by same HttpClient sequential. That means that queries will not be sent in parallel, but only one by one.
So if you register HttpClient as singleton in your app (and this is best practice approach) you will make all queries from different sessions run sequentually across all sessions. And it may decrease throughput of your system a lot. ClickHouse client will use same HttpClient for all sessions in case it's registered as a singleton.
Having a singleton HttpClient
stopped being a recommendation after .NET Core 2.1. The current recommended approach - IHttpClientFactory
- creates a new HttpClient
for each request, in fact. The pooling is done in HttpClientHandler
- and this is actually how ClickHouseClient
works currently by default, there is a static instance of HttpClientHandler
which is shared between instances of ClickHouseConnection
If you create a separate HttpClient
for each connection, with its own instance of HttpClientHandler
:
var handler = new HttpClientHandler { MaxConnectionsPerServer = 1 };
var client = new HttpClient(handler, /*disposeHandler*/true);
var connection = new ClickHouseConnection("UseSessions=true", client);
you will be able to execute parallel queries in different connections, as expected.
This is the approach I'm taking in #222 - if sessions are enabled, use an "owned"/dedicated HttpClientHandler
and HttpClient
per connection
Thank you for your answer!
Official docs states that singleton HttpClient is one of best practice approaches - https://learn.microsoft.com/en-us/dotnet/fundamentals/networking/http/httpclient-guidelines So I can imagine that someone will register HttpClient as singleton and pass it to the library. Anyway even if it's registered using factory all HttpClient will share same connection pool, right? So if I set MaxConnectionsPerServer = 1 using standard ASP approach with factory it will make all queries across all sessions sequential.
(.NET team fixed all DNS related issues in HttpClient so that's why static HttpClient is now one of best practice approach. The only reason to use factory approach is just to easily use such features for example as named clients or Polly.)
Will your solution with separate HttpClient reuse tcp connections for different sessions? I can imagine system that generates many queries. In that case for each query it will create new handler and new connection pool each time. That means that system will not use connection pool at all and will create new connection for each new request. Or maybe I'm missing something.
@DarkWanderer Sorry, I have one more question. How did you know that ClickHouse doesn't work well with same session from different tcp connection? Did you find an issue in ClickHouse repository? We are working with Altinity so maybe we can ask them to help us but I need more details.
Thank you for your answer!
Official docs states that singleton HttpClient is one of best practice approaches - https://learn.microsoft.com/en-us/dotnet/fundamentals/networking/http/httpclient-guidelines
Fair enough. "One of" seems to be the critical part here. The documents seem to be a bit self-contradictory here, given how IHttpClientFactory
works. Personally I don't think it is best practice - the discussed issue being one of the reasons
So I can imagine that someone will register HttpClient as singleton and pass it to the library. Anyway even if it's registered using factory all HttpClient will share same connection pool, right? So if I set MaxConnectionsPerServer = 1 using standard ASP approach with factory it will make all queries across all sessions sequential.
That was not my advice, however. Of course someone can do that - but I cannot reasonably prevent any and all misuses of advanced capabilities
(.NET team fixed all DNS related issues in HttpClient so that's why static HttpClient is now one of best practice approach. The only reason to use factory approach is just to easily use such features for example as named clients or Polly.)
Will your solution with separate HttpClient reuse tcp connections for different sessions? I can imagine system that generates many queries. In that case for each query it will create new handler and new connection pool each time. That means that system will not use connection pool at all and will create new connection for each new request. Or maybe I'm missing something.
To be honest, what you're saying sounds like just another example of misuse. If one needs multiple, parallel transient connections, why would sessions be needed? Sessions are the exact opposite of transient - it's server-side "single thread" limitation for a long-lived connection. Can you provide a valid use case for transient connection which utilizes sessions feature?
@DarkWanderer Sorry, I have one more question. How did you know that ClickHouse doesn't work well with same session from different tcp connection? Did you find an issue in ClickHouse repository? We are working with Altinity so maybe we can ask them to help us but I need more details.
It's a very simple fact - limiting number of connections to 1
per server fixes the issue. I'm sure it can be reproduced using curl
or other tool.
@DarkWanderer thanks, I should have explained our case more.
We have an ASP application that handles many concurrent users. Each user sends requests to our API. That single API request typically generates 2-4 sequential queries to CH. These 2-4 queries share the same session. Some of these 2-4 queries create temporary tables that should be deleted automatically when session ends.
Our HttpClient is registered using factory.
The main question is how to properly configure library so that our HttpClient doesn't have socket exhaustion and DNS changes problems?
For the use-case of sessions I can also imagine using transaction inside a session. In that case all my 2-4 queries will share same consistent version of data. Transactions for "select" is not yet ready currently but as I understand they will use sessions.
I've found this closed issue on GitHub but it seems that it wasn't fixed: https://github.com/ClickHouse/ClickHouse/issues/4003
In 6.2.0, session mode will not use connection pool now and will instead use separate HttpConnectionHandler
and HttpClient
per connection
@nike61 this is a valid use case and I understand your concern about socket exhaustion. The only proper solution I can see at the moment is to introduce library-level connection pooling, which is a significant undertaking I wanted to avoid. If there are any alternative ideas they are very much welcome
Recently I looked into the HttpClient and Handler docs and recommendations. The one slightly mentioned, that the Azure client uses, is the SocketsHttpHandler
which HttpConnectionHandler uses internally. You wouldn't rely on the consumer to pass in a HttpClient
and/or HttpConnectionHandler
as you would have an internal one that handles sockets for you based on the endpoint name. You would need a list of all the connection to pool them, and the SocketsHttpHandler can drop a connection after X time as well to handle dns changes. .net 5 has more exensibility for the Sockets handler as well when 3.1 doesn't but similar functionality. I'll look around in the connection code though I don't know of an easy way to test any style of those changes without a complicated container setup. The Sql Connection pool classes have a lot of history in them that makes me think there are some edge cases that only time will show in any connection pool class.
I'm not sure if it is upstream issue in ClickHouse or some races in provider code, so I will fill it here for now.
Following code fails with
Session is locked by a concurrent client
error for me after ~200 iterations, which is quite annoying as it fails test run at random locations...CH version:
22.6.3.35
Sessions: enabled