apache / incubator-pegasus

Apache Pegasus - A horizontally scalable, strongly consistent and high-performance key-value store
https://pegasus.apache.org/
Apache License 2.0
1.96k stars 310 forks source link

feat: Supports node black list for load balance #1985

Open lengyuexuexuan opened 2 months ago

lengyuexuexuan commented 2 months ago

What problem does this PR solve?

1976

In addition this, solve two issues for cluster_balance_policy.

  1. https://github.com/apache/incubator-pegasus/blob/cd1682d5e2e4668f4263073d5ddb04b8bd7574c4/src/meta/cluster_balance_policy.cpp#L201-L202 std::move(info) is before use the variable info, causes the skew is wrong.

  2. https://github.com/apache/incubator-pegasus/blob/cd1682d5e2e4668f4263073d5ddb04b8bd7574c4/src/meta/load_balance_policy.h#L113 https://github.com/apache/incubator-pegasus/blob/cd1682d5e2e4668f4263073d5ddb04b8bd7574c4/src/meta/greedy_load_balancer.h#L77-L78

The variable _balancer_ignored_apps is not static, that causes _app_balance_policy and _cluster_balance_policy have separate _balancer_ignored_apps. So, when we set _balancer_ignored_apps, it only takes effect on _app_balance_policy.

Both issues will be fixed in this pr.

What is changed and how does it work?

meta.lb.ignored_nodes_list <get|set|clear> [node_addr1,nodes_addr2..] Supports get, set, and clear commands. The number of blacklisted nodes must not exceed the number of alive_nodes minus 2, otherwise balancing will not be possible.

Checklist

Tests

// 测试cluster balance export GTEST_FILTER=meta.cluster_balancer_nodes_blacklist_test ./run.sh test -m dsn.meta.test


- Manual test (add detailed scripts or steps below)
1. Building unbalanced clusters with onebox.
    Use node restart and the command remote_command -t meta-server meta.lb.assign_secondary_black_list $address_list
2. set node blacklist
3. use command set_meta_level lively.
  - app_balance_policy
  The initial state of the cluster is:
   ![image](https://github.com/apache/incubator-pegasus/assets/46274877/4d98354e-6549-4ff6-aa8c-b74127b8edd7)
  Set 172.17.0.2:34801, 172.17.0.2:34806 as blacklisted, and then load-balance with a termination state of:
   ![image (1)](https://github.com/apache/incubator-pegasus/assets/46274877/139712c5-9302-4903-a776-92594b1e4536)
  It can be seen that the number of slices for two nodes, 172.17.0.2:34801 and 172.17.0.2:34806, did not change, and the other four nodes reached a balanced state. After clear ignored_node_list, perform balance, the result is:
  ![image](https://github.com/apache/incubator-pegasus/assets/46274877/e7fee2f7-a68d-441a-90e7-67d44081559e)
   - cluster_balance_policy
  The initial state of the cluster is:
  ![image (1)](https://github.com/apache/incubator-pegasus/assets/46274877/6587456e-19b5-4002-a165-db71e1cb8812)
  Set 172.17.0.2:34801, 172.17.0.2:34806 as blacklisted, and then load-balance with a termination state of:
  ![image](https://github.com/apache/incubator-pegasus/assets/46274877/46382308-fd90-4f44-838d-de21b02e29bf)
  It can be seen that the number of slices for two nodes, 172.17.0.2:34801 and 172.17.0.2:34806, did not change, and the other four nodes reached a balanced state. After clear ignored_node_list, perform balance, the result is:
  ![image (1)](https://github.com/apache/incubator-pegasus/assets/46274877/a905426e-a87f-416c-bd93-4d72f81daccb)