ClickHouse / clickhouse-js

Official JS client for ClickHouse DB
https://clickhouse.com
Apache License 2.0
205 stars 25 forks source link

Keep-Alive degrades performance #196

Closed kiliangrashoff closed 5 months ago

kiliangrashoff commented 1 year ago

Describe the bug

Using keep-alive with clickhouse-js degrades performance of concurrent queries compared to disabling keep-alive.

Steps to reproduce

  1. Clone this minimal reproduction example
  2. Run it with the instructions in the readme Output for me:
    Average response time with keep alive true: 13.6ms
    Average response time with keep alive false: 8.6ms

Numbers vary quite a bit, but keep alive false is consistently faster. Adding a await result.json() to the example improves performance of keep alive true, but this is not what we observe with real workloads.

With a real workload, performance of our api degrades from 95ms to 120ms average response time when enabling keep-alive. We configured keep_alive with the following settings and a keep_alive_timeout of 60s on the server, this does not fix the issue.

keep_alive: {
  enabled: true,
  socket_ttl: 55000,
  retry_on_expired_socket: true,
}

Expected behaviour

Enabling keep-alive should result in better performance.

Executing a SELECT 1 query twice with curl to time the HTTP transaction with an open TCP connection results in 0.6ms response time for the second request, I would expect the JS client to be able to query an open connection with similar response time:

Curl command:
curl --write-out 'lookup:        %{time_namelookup}\nconnect:       %{time_connect}\nappconnect:    %{time_appconnect}\npretransfer:   %{time_pretransfer}\nredirect:      %{time_redirect}\nstarttransfer: %{time_starttransfer}\ntotal:         %{time_total}\n'  localhost:8123/?query=SELECT%201 localhost:8123/?query=SELECT%201

Output:
1
lookup:        0.000031
connect:       0.000210
appconnect:    0.000000
pretransfer:   0.000278
redirect:      0.000000
starttransfer: 0.017353
total:         0.017902
1
lookup:        0.000009
connect:       0.000000
appconnect:    0.000000
pretransfer:   0.000023
redirect:      0.000000
starttransfer: 0.000445
total:         0.000553

Code example

import { ClickHouseClient, createClient } from '@clickhouse/client'
import pLimit from 'p-limit'

const host = 'http://localhost:8123'
const username = 'default'
const password = undefined

const requests = 1000
const workers = 20

async function main() {
  await runPerformanceTest(true)
  await runPerformanceTest(false)
}

async function runPerformanceTest(keep_alive: boolean) {
  const client = await createClickHouseClient(keep_alive)
  const averageResponseTime = await measureAverageResponseTime(client, keep_alive)
  console.log(`Average response time with keep alive ${keep_alive}: ${averageResponseTime}ms`)
  await client.close()
}

async function measureAverageResponseTime(client: ClickHouseClient, keep_alive: boolean) {
  const limit = pLimit(workers)
  const queries = Array.from({length: requests}, () => limit(async () => {
    const start = process.hrtime()
    await client.query({
      query: 'SELECT 1',
      format: 'JSONEachRow'
    })
    const responseTime = process.hrtime(start)[1] / 1000000
    return responseTime
  }))
  const responseTimes = await Promise.all(queries)
  return responseTimes.reduce((a: number, b: number) => a + b, 0) / responseTimes.length
}

async function createClickHouseClient(keep_alive: boolean) {
  return createClient({
    host,
    username,
    password,
    keep_alive: {
      enabled: keep_alive
    },
    compression: {
      response: false,
      request: false
    }
  })
}

main().catch((err) => {
  console.error(err)
  process.exit(1)
})

Note: example depends on p-limit version 3.1.0 for concurrency control, easiest way to run it is to clone the repo I linked.

Configuration

Environment

ClickHouse server

slvrtrn commented 1 year ago

Interesting, thanks for reporting.

Adding a await result.json() to the example improves performance of keep alive true

I think that's because the underlying socket is released once the stream is consumed or destroyed and is returned back to the pool.

You also could try to reduce the number of max_connections in config.xml, cause by default, it is set to 1024.

For example:

config.xml

  <max_connections>10</max_connections>

index.ts

import { ClickHouseClient, createClient } from "@clickhouse/client";
import pLimit from "p-limit";

const host = "http://localhost:8123";
const username = "default";
const password = undefined;

const requests = 1000;
const workers = 20;

async function main() {
  await runPerformanceTest(true);
  await runPerformanceTest(false);
}

async function runPerformanceTest(keep_alive: boolean) {
  const client = await createClickHouseClient(keep_alive);
  const averageResponseTime = await measureAverageResponseTime(client);
  console.log(
    `Average response time with keep alive ${keep_alive}: ${averageResponseTime}ms`
  );
  await client.close();
}

async function measureAverageResponseTime(client: ClickHouseClient) {
  const limit = pLimit(workers);
  const queries = Array.from({ length: requests }, () =>
    limit(async () => {
      const start = process.hrtime();
      const rs = await client.query({
        query: "SELECT * FROM system.tables", // around 30 tables here
        format: "JSONEachRow",
      });
      await rs.json();
      const responseTime = process.hrtime(start)[1] / 1000000;
      return responseTime;
    })
  );
  const responseTimes = await Promise.all(queries);
  return (
    responseTimes.reduce((a: any, b: any) => a + b, 0) / responseTimes.length
  );
}

async function createClickHouseClient(keep_alive: boolean) {
  return createClient({
    host,
    username,
    password,
    keep_alive: {
      enabled: keep_alive,
    },
    compression: {
      response: false,
      request: false,
    },
  });
}

main().catch((err) => {
  console.error(err);
  process.exit(1);
});

I ran a few quick tests on my machine (Linux, Node v16.17.0)

With default max_connections value (1024), KeepAlive is significantly faster, but overall it's pretty slow in both scenarios:

➜  clickhouse-js-performance git:(main) ✗ yarn start
[INFO] 18:14:59 ts-node-dev ver. 2.0.0 (using ts-node ver. 10.9.1, typescript ver. 5.1.6)
Average response time with keep alive true: 229.34405645900026ms
Average response time with keep alive false: 281.2647160240004ms

With max_connections set to 10:

[INFO] 18:55:42 ts-node-dev ver. 2.0.0 (using ts-node ver. 10.9.1, typescript ver. 5.1.6)
Average response time with keep alive true: 83.33827950300005ms
Average response time with keep alive false: 156.89417462500006ms

here's a simplified version of the client in the nutshell that runs similar code:

import Http from "http";
import pLimit from "p-limit";
import Stream from "stream";

const requests = 1000;
const workers = 20;

async function main() {
  await runPerformanceTest(true);
  await runPerformanceTest(false);
}

async function runPerformanceTest(keepAlive: boolean) {
  const agent = new Http.Agent({
    keepAlive,
  });
  const averageResponseTime = await measureAverageResponseTime(agent);
  console.log(
    `Average response time with keep alive ${keepAlive}: ${averageResponseTime}ms`
  );
  agent.destroy();
}

async function measureAverageResponseTime(agent: Http.Agent) {
  const limit = pLimit(workers);
  const headers = {
    Authorization: `Basic ${Buffer.from("default:").toString("base64")}`,
  };
  const queries: Promise<number>[] = Array.from({ length: requests }, (_, i) =>
    limit(async () => {
      const start = process.hrtime();
      return new Promise((resolve, reject) => {
        const bodyStream = Stream.Readable.from([
          "SELECT * FROM system.tables FORMAT JSONEachRow",
        ]);
        const request = Http.request("http://localhost:8123", {
          method: "POST",
          agent,
          headers,
        });
        request.on("response", async (response) => {
          if (response.statusCode !== 200) {
            console.error(`Rejected with ${response.statusCode}`);
            reject(new Error("request failed"));
          }
          await getAsText(response);
          const responseTime = process.hrtime(start)[1] / 1000000;
          resolve(responseTime);
        });
        Stream.pipeline(bodyStream, request, () => {});
      });
    })
  );
  const responseTimes = await Promise.all(queries);
  return (
    responseTimes.reduce((a: any, b: any) => a + b, 0) / responseTimes.length
  );
}

main().catch((err) => {
  console.error(err);
  process.exit(1);
});

async function getAsText(stream: Stream.Readable): Promise<string> {
  let result = "";
  const textDecoder = new TextDecoder();

  for await (const chunk of stream) {
    result += textDecoder.decode(chunk, { stream: true });
  }

  // flush
  result += textDecoder.decode();
  return result;
}

results are in line with what we see when using ClickHouseClient itself:

[INFO] 18:54:32 ts-node-dev ver. 2.0.0 (using ts-node ver. 10.9.1, typescript ver. 5.1.6)
Average response time with keep alive true: 81.46500756599997ms
Average response time with keep alive false: 154.66927978100017ms
kiliangrashoff commented 1 year ago

Thank you for the quick response @slvrtrn! :pray:

I think that's because the underlying socket is released once the stream is consumed or destroyed and is returned back to the pool.

This is not the behavior I expected, I thought the socket would be released as soon as the complete response is received. I suppose await query() only awaits sending the request, and that releasing the socket is blocked until a handler for the response is registered with await json(), await text() or the stream is consumed, is that correct? It would be nice if the socket were released immediately, but I guess the response data would need to be buffered somewhere and I don't know if that's efficiently possible. This would also would mask the problem of the response handler not being created on time.

Your suspicion that the socket is not released seems to be confirmed if we convert the example (without calling json() or text() to a sequential loop and setting max_open_connections to n: only the first n requests are completed, then the process blocks. I suspect that our performance regression in the production setup may be caused by not immediately awaiting json() after awaiting the query, I will test if that is correct and get back to you when I know more, it may take some time before I get around to this.

If this is the intended behavior and it is unavoidable to keep the socket occupied, it would be useful to document that users should immediately json() text() or consume the stream after calling await query(), to avoid blocking sockets, for scenarios with a large number of requests.

You also could try to reduce the number of max_connections in config.xml, cause by default, it is set to 1024.

In production, we limited max open connections by setting max_open_connections in the client, I left it out of this example to keep it minimal. Is there an advantage to also limiting it on the server? I think that setting this too low would cause errors if too many connections are opened by clients, but I haven't tested it yet.

With default max_connections value (1024), KeepAlive is significantly faster, but overall it's pretty slow in both scenarios:

I can confirm that if I add await response.json(), KeepAlive is slightly faster in the example:

Average response time with keep alive true: 8.3ms
Average response time with keep alive false: 9.0ms

However, based on curl results (I added these later to my issue, see above), I would expect results close to 0.6ms. If we reduce the number of workers to 2, we get much closer results to curl, so it seems the slowdown is a result of concurrent execution:

Average response time with keep alive true: 0.8ms
Average response time with keep alive false: 1.6ms

Do you know why this slowdown occurs, and if there is a way to improve it?

slvrtrn commented 1 year ago

By design, query and exec will hold the socket until the stream is either consumed or destroyed - otherwise, there is no way to stream the data; await query(...) will essentially await only the response as if it was successful or not, the actual data is incoming only after iterating over the stream or calling json and text, both of which will consume the stream under the hood.

It is designed in such a way cause we want to be able to stream an arbitrary amount of data without running into OOM issues. This is also why we don't implement any buffering in advance.

command and insert will release the sockets immediately as we do not expect anything useful there in the response.

From what I have seen during my limited testing time, without restricting the max connections number, it looks like the client tries to open as many connections as possible without trying to reuse the existing sockets in the pool.

I think that setting this too low would cause errors if too many connections are opened by clients, but I haven't tested it yet.

I did not encounter any issues with 10 max connections and, for example, 20 workers, but YMMV.

den-crane commented 1 year ago

From what I have seen during my limited testing time, without restricting the max connections number, it looks like the client tries to open as many connections as possible without trying to reuse the existing sockets in the pool.

I am sorry for interfering ( I have zero knowledge with JS ).

But I think it can be limited at the client side (I just googled it) https://nodejs.org/api/http.html#new-agentoptions https://stackoverflow.com/questions/13649972/how-to-set-maximum-http-client-connections-in-node-js

slvrtrn commented 1 year ago

@den-crane, yes, we allow to set it: https://github.com/ClickHouse/clickhouse-js/blob/main/packages/client-node/src/connection/node_http_connection.ts#L16-L19

However, I am getting awkward results with it without server configuration modifications.

These are the results with unlimited server connections and max_open_connections set to 10:

➜  clickhouse-js-performance git:(main) ✗ yarn start
[INFO] 17:31:47 ts-node-dev ver. 2.0.0 (using ts-node ver. 10.9.1, typescript ver. 5.1.6)
Average response time with keep alive false: 164.19364965099996ms
Average response time with keep alive true: 178.51174290500038ms

These are the results with server connections set to 10 in config.xml and without max_open_connections:

➜  clickhouse-js-performance git:(main) ✗ yarn start
[INFO] 17:33:26 ts-node-dev ver. 2.0.0 (using ts-node ver. 10.9.1, typescript ver. 5.1.6)
Average response time with keep alive false: 150.59355878000014ms
Average response time with keep alive true: 84.82651798200006ms

Both set to 10:

➜  clickhouse-js-performance git:(main) ✗ yarn start
[INFO] 17:32:55 ts-node-dev ver. 2.0.0 (using ts-node ver. 10.9.1, typescript ver. 5.1.6)
Average response time with keep alive false: 161.78359237400005ms
Average response time with keep alive true: 162.11750517099998ms

I have zero knowledge with JS

The more I work with it, the more I feel the same :)

kiliangrashoff commented 1 year ago

We indeed set max_open_connections, else the client would open too many connections under high load, causing too much simultaneous queries errors, and causing very high load on the ClickHouse server which made it unresponsive to other queries. I'll try if setting it only server side improves performance with keep alive for us.

slvrtrn commented 11 months ago

@kiliangrashoff, if you use inserts from your code as well, can you please check if 0.2.3 improves the insert performance with keep-alive in your scenarios?