ClickHouse / clickhouse-java

ClickHouse Java Clients & JDBC Driver
https://clickhouse.com
Apache License 2.0
1.43k stars 526 forks source link

When spawning more than 16 requests with Java 11 HTTP client: com.clickhouse.client.ClickHouseException: Code: 159. Execution timed out #1741

Open devurandom opened 1 month ago

devurandom commented 1 month ago

Describe the bug

When spawning more than 16 requests in parallel to a ClickHouse server using the Java HTTP client, all requests will hang and eventually time out (unless I also set MAX_THREADS_PER_CLIENT=1 or greater):

#error {
 :cause nil
 :via
 [{:type java.util.concurrent.ExecutionException
   :message com.clickhouse.client.ClickHouseException: Code: 159. Execution timed out
   :at [java.util.concurrent.FutureTask report FutureTask.java 122]}
  {:type com.clickhouse.client.ClickHouseException
   :message Code: 159. Execution timed out
   :at [com.clickhouse.client.ClickHouseException of ClickHouseException.java 147]}
  {:type java.util.concurrent.TimeoutException
   :message nil
   :at [java.util.concurrent.CompletableFuture timedGet CompletableFuture.java 1960]}]
 :trace
 [[java.util.concurrent.CompletableFuture timedGet CompletableFuture.java 1960]
  [java.util.concurrent.CompletableFuture get CompletableFuture.java 2095]
  [com.clickhouse.client.ClickHouseClient executeAndWait ClickHouseClient.java 878]
  [ch_timeout_mwe$execute_BANG_ invokeStatic ch_timeout_mwe.clj 34]
  [ch_timeout_mwe$execute_BANG_ invoke ch_timeout_mwe.clj 31]
  [ch_timeout_mwe$ping invokeStatic ch_timeout_mwe.clj 40]
  [ch_timeout_mwe$ping invoke ch_timeout_mwe.clj 36]
  [ch_timeout_mwe$_main$iter__158__162$fn__163$fn__164$fn__165 invoke ch_timeout_mwe.clj 56]
  [clojure.core$binding_conveyor_fn$fn__5823 invoke core.clj 2047]
  [clojure.lang.AFn call AFn.java 18]
  [java.util.concurrent.FutureTask run FutureTask.java 317]
  [java.util.concurrent.ThreadPoolExecutor runWorker ThreadPoolExecutor.java 1144]
  [java.util.concurrent.ThreadPoolExecutor$Worker run ThreadPoolExecutor.java 642]
  [java.lang.Thread run Thread.java 1583]]}

Afterwards the application is broken; the only way to make requests to ClickHouse possible again is by restarting the JVM.

Repeatedly sending 16 or fewer requests does not exhibit the hanging requests.

Steps to reproduce

  1. Follow the instructions in https://github.com/devurandom/clickhouse-timeout-mwe to run the sample application.
  2. Observe that everything works fine when spawning 16 or less requests, or when passing a value greater than 0 as MAX_THREADS_PER_CLIENT.
  3. Observe that the application hangs and all requests to ClickHouse throw java.util.concurrent.TimeoutException when spawning 17 or more requests while passing MAX_THREADS_PER_CLIENT as 0.

Expected behaviour

ClickHouse Java HTTP client works, regardless of how many requests I tell it to make at the same time.

Code example

See https://github.com/devurandom/clickhouse-timeout-mwe.

Configuration

Environment

ClickHouse server

CameronMukherjee commented 1 month ago

I believe I am facing the same issue, using Spring Boot Webflux (Reactor).

When creating the ClickHouseNode and ClickHouseClient as a bean:

@Configuration
public class DatabaseDriverConfig {
 @Bean
    public ClickHouseNode nodeFactory() {
        return ClickHouseNode.builder()
          .host(clickhouse.getHost())
          .port(ClickHouseProtocol.HTTP, clickhouse.getPort())
          .database(clickhouse.getDatabase())
          .credentials(ClickHouseCredentials.fromUserAndPassword(
            clickhouse.getUsername(),
            clickhouse.getPassword()
          ))
          .build();
    }

    @Bean
    public ClickHouseClient clientFactory(ClickHouseNode node) {
        return ClickHouseClient.builder()
          .option(ClickHouseClientOption.CLIENT_NAME, "query-engine-api")
          // the below response has async enabled - I have omitted it from this builder.
          .nodeSelector(ClickHouseNodeSelector.of(node.getProtocol()))
          .build();
    }
}

And then using it to make calls to Clickhouse:

public Mono<PageInfo> fetchPageInfo(UUID cid, Integer pageNo) {
        final String q = """
          SELECT COALESCE(COUNT(*), 0) as total_elements
          FROM seer.event
          """;
        return Mono.fromFuture(client.read(node)
            .format(ClickHouseFormat.RowBinaryWithNamesAndTypes)
            .query(q)
            .execute())
          .map(ClickHouseResponse::firstRecord)
          .map(x -> {
              int totalElements = x.getValue("total_elements").asInteger();
              return PageInfo.builder()
                .pageNo(pageNo)
                .totalPages(totalElements / pageNo)
                .totalElements(totalElements)
                .build();
          })
          .onErrorResume(NoSuchElementException.class, e -> Mono.just(PageInfo.builder()
            .pageNo(pageNo)
            .totalPages(0)
            .totalElements(0)
            .build()));
    }

The first 10 requests work, the moment I make the 11th request the connection times-out and is no longer usable. The only way to get the client to work again is to restart the API.

The error looks like the following:

2024-07-21T15:04:01.580+01:00 ERROR 71784 --- [query-engine-api] [kHouseWorker-21] a.w.r.e.AbstractErrorWebExceptionHandler : [47bd0242-11]  500 Server Error for HTTP GET "/event?start=2024-01-01T13:13:13&end=2024-05-01T13:13:13&pageNo=1&limit=500"

com.clickhouse.client.ClickHouseException: Timeout deadline: 180000 MILLISECONDS, actual: 180007 MILLISECONDS, server ClickHouseNode [uri=http://localhost:8123/seer]@-737830262
    at com.clickhouse.client.ClickHouseException.of(ClickHouseException.java:168) ~[clickhouse-client-0.6.0.jar:clickhouse-client 0.6.0 (revision: cb8e198)]
    Suppressed: reactor.core.publisher.FluxOnAssembly$OnAssemblyException: 
Error has been observed at the following site(s):
    *__checkpoint ⇢ Handler com.hexploits.seer.api.queryengine.controller.EventController#fetchEvents(UUID, UUID, String, ServerWebExchange) [DispatcherHandler]
    *__checkpoint ⇢ HTTP GET "/event?start=2024-01-01T13:13:13&end=2024-05-01T13:13:13&pageNo=1&limit=500" [ExceptionHandlingWebHandler]
Original Stack Trace:
        at com.clickhouse.client.ClickHouseException.of(ClickHouseException.java:168) ~[clickhouse-client-0.6.0.jar:clickhouse-client 0.6.0 (revision: cb8e198)]
        at com.clickhouse.client.AbstractClient.lambda$execute$0(AbstractClient.java:275) ~[clickhouse-client-0.6.0.jar:clickhouse-client 0.6.0 (revision: cb8e198)]
        at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1768) ~[na:na]
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[na:na]
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[na:na]
        at java.base/java.lang.Thread.run(Thread.java:833) ~[na:na]
Caused by: org.apache.hc.core5.http.ConnectionRequestTimeoutException: Timeout deadline: 180000 MILLISECONDS, actual: 180007 MILLISECONDS
    at org.apache.hc.client5.http.impl.classic.InternalExecRuntime.acquireEndpoint(InternalExecRuntime.java:120) ~[httpclient5-5.2.3.jar:5.2.3]
    at org.apache.hc.client5.http.impl.classic.ConnectExec.execute(ConnectExec.java:125) ~[httpclient5-5.2.3.jar:5.2.3]
    at org.apache.hc.client5.http.impl.classic.ExecChainElement.execute(ExecChainElement.java:51) ~[httpclient5-5.2.3.jar:5.2.3]
    at org.apache.hc.client5.http.impl.classic.ProtocolExec.execute(ProtocolExec.java:192) ~[httpclient5-5.2.3.jar:5.2.3]
    at org.apache.hc.client5.http.impl.classic.ExecChainElement.execute(ExecChainElement.java:51) ~[httpclient5-5.2.3.jar:5.2.3]
    at org.apache.hc.client5.http.impl.classic.HttpRequestRetryExec.execute(HttpRequestRetryExec.java:96) ~[httpclient5-5.2.3.jar:5.2.3]
    at org.apache.hc.client5.http.impl.classic.ExecChainElement.execute(ExecChainElement.java:51) ~[httpclient5-5.2.3.jar:5.2.3]
    at org.apache.hc.client5.http.impl.classic.RedirectExec.execute(RedirectExec.java:115) ~[httpclient5-5.2.3.jar:5.2.3]
    at org.apache.hc.client5.http.impl.classic.ExecChainElement.execute(ExecChainElement.java:51) ~[httpclient5-5.2.3.jar:5.2.3]
    at org.apache.hc.client5.http.impl.classic.InternalHttpClient.doExecute(InternalHttpClient.java:170) ~[httpclient5-5.2.3.jar:5.2.3]
    at org.apache.hc.client5.http.impl.classic.CloseableHttpClient.execute(CloseableHttpClient.java:123) ~[httpclient5-5.2.3.jar:5.2.3]
    at com.clickhouse.client.http.ApacheHttpConnectionImpl.post(ApacheHttpConnectionImpl.java:241) ~[clickhouse-http-client-0.6.0.jar:clickhouse-http-client 0.6.0 (revision: cb8e198)]
    at com.clickhouse.client.http.ClickHouseHttpClient.send(ClickHouseHttpClient.java:118) ~[clickhouse-http-client-0.6.0.jar:clickhouse-http-client 0.6.0 (revision: cb8e198)]
    at com.clickhouse.client.AbstractClient.sendAsync(AbstractClient.java:161) ~[clickhouse-client-0.6.0.jar:clickhouse-client 0.6.0 (revision: cb8e198)]
    at com.clickhouse.client.AbstractClient.lambda$execute$0(AbstractClient.java:273) ~[clickhouse-client-0.6.0.jar:clickhouse-client 0.6.0 (revision: cb8e198)]
    at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1768) ~[na:na]
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[na:na]
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[na:na]
    at java.base/java.lang.Thread.run(Thread.java:833) ~[na:na]
CameronMukherjee commented 1 month ago

When spawning more than 16 requests in parallel to a ClickHouse server using the Java HTTP client, all requests will hang and eventually time out (unless I also set MAX_THREADS_PER_CLIENT=1 or greater, the default being 0):

As a heads up - the following does not work for me:

 @Bean
    public ClickHouseClient clientFactory(ClickHouseNode node) {
        return ClickHouseClient.builder()
          .option(ClickHouseClientOption.CLIENT_NAME, "query-engine-api")
          .nodeSelector(ClickHouseNodeSelector.of(node.getProtocol()))
          .option(ClickHouseClientOption.ASYNC, true)
          .option(ClickHouseClientOption.MAX_THREADS_PER_CLIENT, 1)
          .build();
    }

I've also tried setting the max threads per client to 5, 10 and 15 - with none of them working as intended. The moment the 11th request comes through the request times-outs.

CameronMukherjee commented 1 month ago

Removing the implementation 'org.apache.httpcomponents.client5:httpclient5:5.2.3' dependency fixes this issue as you fall back to HTTP_URL_CONNECTION which doesn't have this issue.

Hope this helps someone else!

devurandom commented 1 month ago

Removing the implementation 'org.apache.httpcomponents.client5:httpclient5:5.2.3' dependency fixes this issue as you fall back to HTTP_URL_CONNECTION which doesn't have this issue.

I am using the Java 11 HTTP client, cf. https://github.com/devurandom/clickhouse-timeout-mwe/blob/88a48525a6ef68b21cb9dbea0fe0d5019f3ac36b/src/ch_timeout_mwe.clj#L20-L23.

chernser commented 1 month ago

Good day, @devurandom @CameronMukherjee ! The problem looks like when response object is not closed. It will keep connection open and would not let it return to a connection pool (that is in apache HC built-in). With HttpURLConnection it will create unlimited number of connection and may eventually fail by open file descriptors.

devurandom commented 1 month ago

Update: Setting ClickHouseClientOption/MAX_THREADS_PER_CLIENT 2 (instead of the default of 0), which makes the problem disappear in my minimal example, does not seem to workaround the issue in our larger application. It is still unclear to me why.

Removing the implementation 'org.apache.httpcomponents.client5:httpclient5:5.2.3' dependency fixes this issue as you fall back to HTTP_URL_CONNECTION which doesn't have this issue.

I can confirm that changing https://github.com/devurandom/clickhouse-timeout-mwe/blob/88a48525a6ef68b21cb9dbea0fe0d5019f3ac36b/src/ch_timeout_mwe.clj#L20-L23 from {ClickHouseHttpOption/CONNECTION_PROVIDER HttpConnectionProvider/HTTP_CLIENT} to {ClickHouseHttpOption/CONNECTION_PROVIDER HttpConnectionProvider/HTTP_URL_CONNECTION} also works around the issue in my minimal example. (I did not try it in our service for the reason given by @chernser.)

The problem looks like when response object is not closed.

So this is related to https://github.com/ClickHouse/clickhouse-java/issues/1712 ?

It will keep connection open and would not let it return to a connection pool (that is in apache HC built-in).

My minimal example reproduces this without the Apache HTTP client. It uses the Java 11 HTTP client instead.

With HttpURLConnection it will create unlimited number of connection and may eventually fail by open file descriptors.

So setting {ClickHouseHttpOption/CONNECTION_PROVIDER HttpConnectionProvider/HTTP_URL_CONNECTION} is not a workaround that would work in a long running application or service. Thanks for this clarification!

devurandom commented 1 month ago

The machine I reproduced https://github.com/devurandom/clickhouse-timeout-mwe on has 8 as "processors" reported by (.availableProcessors (Runtime/getRuntime)), while our actual application runs on a machine / in a container with 2 "processors" (cf. https://opentelemetry.io/docs/specs/semconv/runtime/jvm-metrics/#metric-jvmcpucount).

This seems relevant in the context of com.clickhouse.client.http.ClickHouseHttpClient#newConnection which uses the executor created by com.clickhouse.client.AbstractClient#init via com.clickhouse.client.ClickHouseClient#getExecutorService and com.clickhouse.data.ClickHouseDataStreamFactory.DefaultExecutors with (+ 1 (* 2 (.availableProcessors (Runtime/getRuntime)))) threads if MAX_THREADS_PER_CLIENT is unset (i.e. 0).

Setting -XX:ActiveProcessorCount=2 indeed changes the number of requests ClickHouse can handle without hanging:

(.availableProcessors (Runtime/getRuntime))
=> 2

(-main "${ENDPOINT}" "${USERNAME}" "${PASSWORD}" 2 0)

Spawning threads...
Wait for all threads:
Waiting for thread 0
Waiting for thread 1
Collected all threads!
=> nil

(-main "${ENDPOINT}" "${USERNAME}" "${PASSWORD}" 4 0)

Spawning threads...
Wait for all threads:
Waiting for thread 0
Waiting for thread 1
Waiting for thread 2
Waiting for thread 3
Collected all threads!
=> nil

(-main "${ENDPOINT}" "${USERNAME}" "${PASSWORD}" 5 0)

Spawning threads...
Wait for all threads:
Waiting for thread 0
ERROR: #error {
 :cause nil
 :via
 [{:type java.util.concurrent.ExecutionException
   :message com.clickhouse.client.ClickHouseException: Code: 159. Execution timed out
   :at [java.util.concurrent.FutureTask report FutureTask.java 122]}
  {:type com.clickhouse.client.ClickHouseException
   :message Code: 159. Execution timed out
   :at [com.clickhouse.client.ClickHouseException of ClickHouseException.java 147]}
  {:type java.util.concurrent.TimeoutException
   :message nil
   :at [java.util.concurrent.CompletableFuture timedGet CompletableFuture.java 1960]}]
 :trace
 [[java.util.concurrent.CompletableFuture timedGet CompletableFuture.java 1960]
  [java.util.concurrent.CompletableFuture get CompletableFuture.java 2095]
  [com.clickhouse.client.ClickHouseClient executeAndWait ClickHouseClient.java 878]
  [ch_timeout_mwe$execute_BANG_ invokeStatic ch_timeout_mwe.clj 34]
  [ch_timeout_mwe$execute_BANG_ invoke ch_timeout_mwe.clj 31]
  [ch_timeout_mwe$ping invokeStatic ch_timeout_mwe.clj 40]
  [ch_timeout_mwe$ping invoke ch_timeout_mwe.clj 36]
  [ch_timeout_mwe$_main$iter__1943__1947$fn__1948$fn__1949$fn__1950 invoke ch_timeout_mwe.clj 56]
  [clojure.core$binding_conveyor_fn$fn__5823 invoke core.clj 2047]
  [clojure.lang.AFn call AFn.java 18]
  [java.util.concurrent.FutureTask run FutureTask.java 264]
  [java.util.concurrent.ThreadPoolExecutor runWorker ThreadPoolExecutor.java 1136]
  [java.util.concurrent.ThreadPoolExecutor$Worker run ThreadPoolExecutor.java 635]
  [java.lang.Thread run Thread.java 840]]}
...

The -main function is the one in https://github.com/devurandom/clickhouse-timeout-mwe.

devurandom commented 1 month ago

We are now using java.util.concurrent.Semaphore to limit the number of request we submit to the ClickHouse client to 2 times the number of processors ((* 2 (.availableProcessors (Runtime/getRuntime)))) and our service finally sends out requests, receives responses and runs stably.

chernser commented 1 month ago

@devurandom Thank you for the information!

devurandom commented 1 month ago

I used the word "deadlock" in my previous comment which might have caused confusion. I should have written "hang" (or "block" or similar). I corrected that in my original comment above.

Do you close Response objects?

In the minimal example, if a response would be received instead of a timeout exception, the response would be closed, cf. https://github.com/devurandom/clickhouse-timeout-mwe/blob/main/src/ch_timeout_mwe.clj#L31-L34.

The problem is that no response is ever received by the code of the minimal example, because there seems to be a problem within the ClickHouse client that prevents a request from ever being processed. The ClickHouse client throws a timeout exception before it ever returns a response.

Am I right that deadlock happens if task is submitted to the executor service when no idle workers?

I cannot answer this question, as I do not know enough about the inner workings of the ClickHouse client. I just stepped through the ClickHouse client code with a debugger and took notes.

Would you please describe your high-level logic? where do you lock?

In the minimal example, no locks are used. It first spawns a bunch of futures on Clojure's own thread pool (not shared with e.g. the Java 11 HTTP client or the ClickHouse client) and then waits for them to complete, cf. https://github.com/devurandom/clickhouse-timeout-mwe/blob/main/src/ch_timeout_mwe.clj#L55-L63.