Open levonet opened 4 years ago
Hi @levonet! Could you pls verify un-even request distribution by looking into metrics exported by proxy?
un-even request distribution
— I don't know what that means.
Can you specify which metric to look at?
Sorry, uneven
distribution.
Metrics to check are following:
In config
clusters:
- name: wdwh
nodes:
- 10.0.0.20:8123 # removed after 23:50
- 10.0.0.22:8123
- 10.0.0.82:8123
- 10.0.0.83:8123
kill_query_user:
name: "{{ clickhouse_user }}"
password: "{{ clickhouse_pswd }}"
users:
- name: "{{ clickhouse_user }}"
password: "{{ clickhouse_pswd }}"
max_concurrent_queries: 45
max_execution_time: 10s
max_queue_size: 1000000
max_queue_time: 60s
heartbeat:
request: "/?query=show%20tables%20from%20dwh%20like%20'access\\_log\\_api\\_buffer'"
response: "access_log_api_buffer\n"
Unfortunately I can't say much from the graphs. The balancing algorithm is not exactly round-robin, it is least-loaded round robin. So if you have CH nodes with different performance or some queries are heavier than others - it is okay to have unbalanced number of requests. You can try to reproduce the bug and I'll be thankful to accept the PR. The tests described here may help to build the scenario showing the uneven distribution.
Hi @levonet, it's been a while. Do you still need help on this issue?
Hi @Garnek20, my concern is what is used "next" round-robin instead of round-robin or least-loaded round-robin. The load falls on the next service after the failed one. At least I see this picture on the diagrams.
If you think that everything is working satisfactorily on your side, I agree to close this issue, since this problem has a low priority for us.
Hello, I've some issues with query spreading. One of the 3 nodes receives much more requests... CH servers have the identical configuration. What can be wrong?
# TYPE host_health gauge
host_health{cluster="default",cluster_node="{host1}:8123",replica="default"} 1
host_health{cluster="default",cluster_node="{host2}:8123",replica="default"} 1
host_health{cluster="default",cluster_node="{host3}:8123",replica="default"} 1
# TYPE host_penalties_total counter
host_penalties_total{cluster="default",cluster_node="{host1}:8123",replica="default"} 1
host_penalties_total{cluster="default",cluster_node="{host2}:8123",replica="default"} 14
host_penalties_total{cluster="default",cluster_node="{host3}:8123",replica="default"} 2
# TYPE request_success_total counter
request_success_total{cluster="default",cluster_node="{host1}:8123",cluster_user="user",replica="default",user="user"} 465131
request_success_total{cluster="default",cluster_node="{host2}:8123",cluster_user="user",replica="default",user="user"} 5.441952e+06
request_success_total{cluster="default",cluster_node="{host3}:8123",cluster_user="user",replica="default",user="user"} 467559
Config quite simple:
server:
http:
listen_addr: "127.0.0.1:9090"
allowed_networks: ["127.0.0.1/32"]
users:
- name: "user"
to_cluster: "default"
to_user: "user"
clusters:
- name: "default"
nodes: ["{host1}:8123", "{host2}:8123", "{host3}:8123"]
users:
- name: "user"
Before 23:50, there were 4 nodes in the cluster where 1 node was broken. After 23:50, there are 3 working nodes in a cluster. The graphs show an uneven distribution of queries. All queries from the first node are moved to the next node. This behavior can lead to cascading overload of all nodes.