True Round Robin Routing

levonet commented 4 years ago

Before 23:50, there were 4 nodes in the cluster where 1 node was broken. After 23:50, there are 3 working nodes in a cluster. The graphs show an uneven distribution of queries. All queries from the first node are moved to the next node. This behavior can lead to cascading overload of all nodes.

Знімок екрана 2020-02-21 о 23 59 05

hagen1778 commented 4 years ago

Hi @levonet! Could you pls verify un-even request distribution by looking into metrics exported by proxy?

levonet commented 4 years ago

un-even request distribution — I don't know what that means. Can you specify which metric to look at?

hagen1778 commented 4 years ago

Sorry, uneven distribution. Metrics to check are following:

host_health
host_penalties_total
request_sum_total by replica

levonet commented 4 years ago

In config

      clusters:
      - name: wdwh
        nodes:
        - 10.0.0.20:8123 # removed after 23:50
        - 10.0.0.22:8123
        - 10.0.0.82:8123
        - 10.0.0.83:8123
        kill_query_user:
          name: "{{ clickhouse_user }}"
          password: "{{ clickhouse_pswd }}"
        users:
        - name: "{{ clickhouse_user }}"
          password: "{{ clickhouse_pswd }}"
          max_concurrent_queries: 45
          max_execution_time: 10s
          max_queue_size: 1000000
          max_queue_time: 60s
        heartbeat:
          request: "/?query=show%20tables%20from%20dwh%20like%20'access\\_log\\_api\\_buffer'"
          response: "access_log_api_buffer\n"

Знімок екрана 2020-02-28 о 13 38 21

hagen1778 commented 4 years ago

Unfortunately I can't say much from the graphs. The balancing algorithm is not exactly round-robin, it is least-loaded round robin. So if you have CH nodes with different performance or some queries are heavier than others - it is okay to have unbalanced number of requests. You can try to reproduce the bug and I'll be thankful to accept the PR. The tests described here may help to build the scenario showing the uneven distribution.

gontarzpawel commented 2 years ago

Hi @levonet, it's been a while. Do you still need help on this issue?

levonet commented 2 years ago

Hi @Garnek20, my concern is what is used "next" round-robin instead of round-robin or least-loaded round-robin. The load falls on the next service after the failed one. At least I see this picture on the diagrams.

If you think that everything is working satisfactorily on your side, I agree to close this issue, since this problem has a low priority for us.

ghost commented 1 year ago

Hello, I've some issues with query spreading. One of the 3 nodes receives much more requests... CH servers have the identical configuration. What can be wrong?

# TYPE host_health gauge
host_health{cluster="default",cluster_node="{host1}:8123",replica="default"} 1
host_health{cluster="default",cluster_node="{host2}:8123",replica="default"} 1
host_health{cluster="default",cluster_node="{host3}:8123",replica="default"} 1

# TYPE host_penalties_total counter
host_penalties_total{cluster="default",cluster_node="{host1}:8123",replica="default"} 1
host_penalties_total{cluster="default",cluster_node="{host2}:8123",replica="default"} 14
host_penalties_total{cluster="default",cluster_node="{host3}:8123",replica="default"} 2

# TYPE request_success_total counter
request_success_total{cluster="default",cluster_node="{host1}:8123",cluster_user="user",replica="default",user="user"} 465131
request_success_total{cluster="default",cluster_node="{host2}:8123",cluster_user="user",replica="default",user="user"} 5.441952e+06
request_success_total{cluster="default",cluster_node="{host3}:8123",cluster_user="user",replica="default",user="user"} 467559

Config quite simple:

server:
  http:
    listen_addr: "127.0.0.1:9090"
    allowed_networks: ["127.0.0.1/32"]

users:
  - name: "user"
    to_cluster: "default"
    to_user: "user"

clusters:
  - name: "default"
    nodes: ["{host1}:8123", "{host2}:8123", "{host3}:8123"]
    users:
      - name: "user"

ContentSquare / chproxy

True Round Robin Routing #82