killQuery needs luck with a TCP-balanced cluster

Hello,

Describe the bug

When killQuery happens, a TCP/k8s loadbalancer may direct the connection to a node which doesn't run the to-be-killed query. KILL succeeds anyway, but wasn't effective.

We will update our configuration to list shard/nodes directly instead as a fix, trading complexity there.

To Reproduce

Declare a cluster with loadbalancer address, instead of replica list
Run & cancel a long running query
KILL runs on a different node than initial query, thus ineffective

Expected behavior

Could you consider selecting the kill targets with initial_query_id instead? It would improve the chance of cutting out resources consumption early.

OTOH, KILL QUERY ON CLUSTER {cluster} would require configuring/passing the "native" cluster name somewhere.

Environment information

For our production clusters, we supply applications with a keepalived-balanced endpoint. In chproxy config:

  scheme: https
  nodes:
  - lb-clickhouse.example:8443

Screenshots

DEBUG: 2024/05/16 18:14:56 proxy.go:84: [ Id: 17CF3EA10D4DDB62; User "u"(1) proxying as "p"(1) to "lb-clickhouse.example:8443"(1); RemoteAddr: "....
DEBUG: 2024/05/16 18:15:36 proxy.go:238: [ Id: 17CF3EA10D4DDB62; User "u"(1) proxying as "p"(1) to "lb-clickhouse.example:8443"(2); RemoteAddr: "..."; LocalAddr: "..."; Duration: 39435136 μs]: remote client closed the connection in 39.433581047s; query: "select ...
DEBUG: 2024/05/16 18:15:36 scope.go:256: killing the query with query_id=17CF3EA10D4DDB62
DEBUG: 2024/05/16 18:15:36 scope.go:296: killed the query with query_id=17CF3EA10D4DDB62; respBody: ""
DEBUG: 2024/05/16 18:15:36 proxy.go:156: [ Id: 17CF3EA10D4DDB62; User "u"(1) proxying as "p"(1) to "lb-clickhouse.example:8443"(2); RemoteAddr: "..."; LocalAddr: "..."; Duration: 39854873 μs]: request failure: non-200 status code 502; query: "select....FORMAT JSONCompact"; Method: POST; URL: "https://lb-clickhouse.example:8443/?max_execution_time=10800&max_memory_usage=42949672960&priority=4&query_id=17CF3EA10D4DDB62&result_overflow_mode=throw&session_timeout=60"

The KILL query ran at node ch3v, while the other nodes wasted time running the query to the end:

SELECT
    hostname,
    is_initial_query,
    type,
    event_time
FROM system.distributed_query_log
WHERE (event_date = '2024-05-16') AND (type > 1) AND (initial_query_id = '17CF3EA10D4DDB62')

hostname──┬─is_initial_query─┬─type────────┬──────────event_time─
     ch4v │                0 │ QueryFinish │ 2024-05-16 19:43:50
     ch5v │                1 │ QueryFinish │ 2024-05-16 19:43:51
     ch2v │                0 │ QueryFinish │ 2024-05-16 19:43:50

Environment information

chproxy v1.19.0, clickhouse 22.8

thank you

(sorry, I created issue from code line. I updated description according to BUG template)

ContentSquare / chproxy

killQuery needs luck with a TCP-balanced cluster #434