When killQuery happens, a TCP/k8s loadbalancer may direct the connection to a node which doesn't run the to-be-killed query. KILL succeeds anyway, but wasn't effective.
We will update our configuration to list shard/nodes directly instead as a fix, trading complexity there.
To Reproduce
Declare a cluster with loadbalancer address, instead of replica list
Run & cancel a long running query
KILL runs on a different node than initial query, thus ineffective
Expected behavior
Could you consider selecting the kill targets with initial_query_id instead? It would improve the chance of cutting out resources consumption early.
OTOH, KILL QUERY ON CLUSTER {cluster} would require configuring/passing the "native" cluster name somewhere.
Environment information
For our production clusters, we supply applications with a keepalived-balanced endpoint. In chproxy config:
scheme: https
nodes:
- lb-clickhouse.example:8443
Screenshots
DEBUG: 2024/05/16 18:14:56 proxy.go:84: [ Id: 17CF3EA10D4DDB62; User "u"(1) proxying as "p"(1) to "lb-clickhouse.example:8443"(1); RemoteAddr: "....
DEBUG: 2024/05/16 18:15:36 proxy.go:238: [ Id: 17CF3EA10D4DDB62; User "u"(1) proxying as "p"(1) to "lb-clickhouse.example:8443"(2); RemoteAddr: "..."; LocalAddr: "..."; Duration: 39435136 μs]: remote client closed the connection in 39.433581047s; query: "select ...
DEBUG: 2024/05/16 18:15:36 scope.go:256: killing the query with query_id=17CF3EA10D4DDB62
DEBUG: 2024/05/16 18:15:36 scope.go:296: killed the query with query_id=17CF3EA10D4DDB62; respBody: ""
DEBUG: 2024/05/16 18:15:36 proxy.go:156: [ Id: 17CF3EA10D4DDB62; User "u"(1) proxying as "p"(1) to "lb-clickhouse.example:8443"(2); RemoteAddr: "..."; LocalAddr: "..."; Duration: 39854873 μs]: request failure: non-200 status code 502; query: "select....FORMAT JSONCompact"; Method: POST; URL: "https://lb-clickhouse.example:8443/?max_execution_time=10800&max_memory_usage=42949672960&priority=4&query_id=17CF3EA10D4DDB62&result_overflow_mode=throw&session_timeout=60"
The KILL query ran at node ch3v, while the other nodes wasted time running the query to the end:
Hello,
Describe the bug
When
killQuery
happens, a TCP/k8s loadbalancer may direct the connection to a node which doesn't run the to-be-killed query.KILL
succeeds anyway, but wasn't effective.We will update our configuration to list shard/nodes directly instead as a fix, trading complexity there.
To Reproduce
KILL
runs on a different node than initial query, thus ineffectiveExpected behavior
Could you consider selecting the kill targets with
initial_query_id
instead? It would improve the chance of cutting out resources consumption early.OTOH,
KILL QUERY ON CLUSTER {cluster}
would require configuring/passing the "native" cluster name somewhere.Environment information
For our production clusters, we supply applications with a keepalived-balanced endpoint. In chproxy config:
Screenshots
The KILL query ran at node
ch3v
, while the other nodes wasted time running the query to the end:Environment information
chproxy v1.19.0, clickhouse 22.8
thank you
(sorry, I created issue from code line. I updated description according to BUG template)