basho / riak_core

Distributed systems infrastructure used by Riak.
Apache License 2.0
1.23k stars 392 forks source link

Leave and attempt_simple_transfer #970

Closed martinsumner closed 1 year ago

martinsumner commented 2 years ago

In recent refactoring of cluster changes - https://github.com/basho/riak_core/pull/913 and https://github.com/basho/riak_core/pull/967 - the focus has been on behaviour on join operations, and not leave.

Recently a customer achieved a balanced ring through a join (as expected with the new cluster claim algorithm), but then leave plans kept creating unbalanced rings i.e. rings whereby partitions were unevenly distributed leading to the potential for "one slow node" problems.

...

[EDIT] There were various incorrect statements made initially here, about how leave works. See later comments for a more accurate representation of the problem ...

There are perhaps some simple things that can be done:

  1. Perhaps a configuration option to force rebalance on leave. This can then be enabled for users of location. This perhaps could also be enabled if repeated re-planning is not leading to balanced outcomes.
  2. The attempt_simple_transfer/4 could check how many partitions are already owned by. a candidate node, and prefer candidate nodes with lower levels of existing ownership. This should be more likely to give balanced results (although it will do nothing in terms of location awareness.

There be demons here. I don't think there's a lot of existing test coverage of leave scenarios (riak_test/src/rt.erl has a staged_join/1 function but no staged_leave/1). There could be the potential for confusing situations when the configuration setting (1) changes between nodes and between staging and committing changes.

martinsumner commented 2 years ago

@systream

martinsumner commented 2 years ago

The above isn't based on a proper understanding of how leave works.

Take an 8-node cluster, and attempt a leave:

=============================== Staged Changes ================================
Action         Details(s)
-------------------------------------------------------------------------------
leave          'dev7@127.0.0.1'
-------------------------------------------------------------------------------

NOTE: Applying these changes will result in 2 cluster transitions

###############################################################################
                         After cluster transition 1/2
###############################################################################

================================= Membership ==================================
Status     Ring    Pending    Node
-------------------------------------------------------------------------------
leaving    12.5%      0.0%    dev7@127.0.0.1
valid      12.5%     12.5%    dev1@127.0.0.1
valid      12.5%     12.5%    dev2@127.0.0.1
valid      12.5%     25.0%    dev3@127.0.0.1
valid      12.5%     12.5%    dev4@127.0.0.1
valid      12.5%     12.5%    dev5@127.0.0.1
valid      12.5%     12.5%    dev6@127.0.0.1
valid      12.5%     12.5%    dev8@127.0.0.1
-------------------------------------------------------------------------------
Valid:7 / Leaving:1 / Exiting:0 / Joining:0 / Down:0

Transfers resulting from cluster changes: 8
  8 transfers from 'dev7@127.0.0.1' to 'dev3@127.0.0.1'

###############################################################################
                         After cluster transition 2/2
###############################################################################

================================= Membership ==================================
Status     Ring    Pending    Node
-------------------------------------------------------------------------------
valid      12.5%     15.6%    dev1@127.0.0.1
valid      12.5%     14.1%    dev2@127.0.0.1
valid      25.0%     14.1%    dev3@127.0.0.1
valid      12.5%     14.1%    dev4@127.0.0.1
valid      12.5%     14.1%    dev5@127.0.0.1
valid      12.5%     14.1%    dev6@127.0.0.1
valid      12.5%     14.1%    dev8@127.0.0.1
-------------------------------------------------------------------------------
Valid:7 / Leaving:0 / Exiting:0 / Joining:0 / Down:0

Transfers resulting from cluster changes: 53
  1 transfers from 'dev6@127.0.0.1' to 'dev5@127.0.0.1'
  1 transfers from 'dev5@127.0.0.1' to 'dev3@127.0.0.1'
  1 transfers from 'dev4@127.0.0.1' to 'dev2@127.0.0.1'
  1 transfers from 'dev2@127.0.0.1' to 'dev8@127.0.0.1'
  1 transfers from 'dev1@127.0.0.1' to 'dev6@127.0.0.1'
  1 transfers from 'dev8@127.0.0.1' to 'dev5@127.0.0.1'
  1 transfers from 'dev6@127.0.0.1' to 'dev3@127.0.0.1'
  1 transfers from 'dev5@127.0.0.1' to 'dev2@127.0.0.1'
  1 transfers from 'dev4@127.0.0.1' to 'dev1@127.0.0.1'
  1 transfers from 'dev2@127.0.0.1' to 'dev6@127.0.0.1'
  1 transfers from 'dev1@127.0.0.1' to 'dev5@127.0.0.1'
  2 transfers from 'dev8@127.0.0.1' to 'dev4@127.0.0.1'
  2 transfers from 'dev6@127.0.0.1' to 'dev2@127.0.0.1'
  2 transfers from 'dev5@127.0.0.1' to 'dev1@127.0.0.1'
  2 transfers from 'dev4@127.0.0.1' to 'dev8@127.0.0.1'
  3 transfers from 'dev3@127.0.0.1' to 'dev6@127.0.0.1'
  2 transfers from 'dev2@127.0.0.1' to 'dev5@127.0.0.1'
  2 transfers from 'dev1@127.0.0.1' to 'dev4@127.0.0.1'
  2 transfers from 'dev8@127.0.0.1' to 'dev3@127.0.0.1'
  1 transfers from 'dev3@127.0.0.1' to 'dev2@127.0.0.1'
  1 transfers from 'dev6@127.0.0.1' to 'dev1@127.0.0.1'
  1 transfers from 'dev5@127.0.0.1' to 'dev8@127.0.0.1'
  1 transfers from 'dev4@127.0.0.1' to 'dev6@127.0.0.1'
  1 transfers from 'dev3@127.0.0.1' to 'dev5@127.0.0.1'
  1 transfers from 'dev2@127.0.0.1' to 'dev4@127.0.0.1'
  1 transfers from 'dev1@127.0.0.1' to 'dev3@127.0.0.1'
  1 transfers from 'dev8@127.0.0.1' to 'dev2@127.0.0.1'
  3 transfers from 'dev3@127.0.0.1' to 'dev1@127.0.0.1'
  2 transfers from 'dev6@127.0.0.1' to 'dev8@127.0.0.1'
  2 transfers from 'dev5@127.0.0.1' to 'dev6@127.0.0.1'
  2 transfers from 'dev4@127.0.0.1' to 'dev5@127.0.0.1'
  3 transfers from 'dev3@127.0.0.1' to 'dev4@127.0.0.1'
  1 transfers from 'dev2@127.0.0.1' to 'dev3@127.0.0.1'
  1 transfers from 'dev1@127.0.0.1' to 'dev2@127.0.0.1'
  1 transfers from 'dev8@127.0.0.1' to 'dev1@127.0.0.1'
  2 transfers from 'dev3@127.0.0.1' to 'dev8@127.0.0.1'

Two things occur here. Firstly it looks like we do a simple transfer, and only node 3 is valid for taking vnodes from node 7 (with respect to target_n_val).

This is really bad, as this requires double the capacity on node 3 to complete the operation.

However, clearly a second phase to rebalance does occur.

If we put in code to prevent simple transfer from being attempted we get:

=============================== Staged Changes ================================
Action         Details(s)
-------------------------------------------------------------------------------
leave          'dev7@127.0.0.1'
-------------------------------------------------------------------------------

NOTE: Applying these changes will result in 2 cluster transitions

###############################################################################
                         After cluster transition 1/2
###############################################################################

================================= Membership ==================================
Status     Ring    Pending    Node
-------------------------------------------------------------------------------
leaving    12.5%      0.0%    dev7@127.0.0.1
valid      12.5%     15.6%    dev1@127.0.0.1
valid      12.5%     14.1%    dev2@127.0.0.1
valid      12.5%     14.1%    dev3@127.0.0.1
valid      12.5%     14.1%    dev4@127.0.0.1
valid      12.5%     14.1%    dev5@127.0.0.1
valid      12.5%     14.1%    dev6@127.0.0.1
valid      12.5%     14.1%    dev8@127.0.0.1
-------------------------------------------------------------------------------
Valid:7 / Leaving:1 / Exiting:0 / Joining:0 / Down:0

WARNING: Not all replicas will be on distinct nodes

Transfers resulting from cluster changes: 51
  1 transfers from 'dev7@127.0.0.1' to 'dev6@127.0.0.1'
  1 transfers from 'dev6@127.0.0.1' to 'dev5@127.0.0.1'
  1 transfers from 'dev5@127.0.0.1' to 'dev4@127.0.0.1'
  1 transfers from 'dev4@127.0.0.1' to 'dev3@127.0.0.1'
  1 transfers from 'dev3@127.0.0.1' to 'dev2@127.0.0.1'
  1 transfers from 'dev2@127.0.0.1' to 'dev1@127.0.0.1'
  1 transfers from 'dev1@127.0.0.1' to 'dev8@127.0.0.1'
  1 transfers from 'dev8@127.0.0.1' to 'dev6@127.0.0.1'
  1 transfers from 'dev7@127.0.0.1' to 'dev5@127.0.0.1'
  1 transfers from 'dev6@127.0.0.1' to 'dev4@127.0.0.1'
  1 transfers from 'dev5@127.0.0.1' to 'dev3@127.0.0.1'
  1 transfers from 'dev4@127.0.0.1' to 'dev2@127.0.0.1'
  1 transfers from 'dev3@127.0.0.1' to 'dev1@127.0.0.1'
  1 transfers from 'dev2@127.0.0.1' to 'dev8@127.0.0.1'
  1 transfers from 'dev1@127.0.0.1' to 'dev6@127.0.0.1'
  1 transfers from 'dev8@127.0.0.1' to 'dev5@127.0.0.1'
  1 transfers from 'dev7@127.0.0.1' to 'dev4@127.0.0.1'
  1 transfers from 'dev6@127.0.0.1' to 'dev3@127.0.0.1'
  1 transfers from 'dev5@127.0.0.1' to 'dev2@127.0.0.1'
  1 transfers from 'dev4@127.0.0.1' to 'dev1@127.0.0.1'
  1 transfers from 'dev3@127.0.0.1' to 'dev8@127.0.0.1'
  1 transfers from 'dev2@127.0.0.1' to 'dev6@127.0.0.1'
  1 transfers from 'dev1@127.0.0.1' to 'dev5@127.0.0.1'
  1 transfers from 'dev8@127.0.0.1' to 'dev4@127.0.0.1'
  1 transfers from 'dev7@127.0.0.1' to 'dev3@127.0.0.1'
  1 transfers from 'dev6@127.0.0.1' to 'dev2@127.0.0.1'
  1 transfers from 'dev5@127.0.0.1' to 'dev1@127.0.0.1'
  1 transfers from 'dev4@127.0.0.1' to 'dev8@127.0.0.1'
  1 transfers from 'dev3@127.0.0.1' to 'dev6@127.0.0.1'
  1 transfers from 'dev2@127.0.0.1' to 'dev5@127.0.0.1'
  1 transfers from 'dev1@127.0.0.1' to 'dev4@127.0.0.1'
  1 transfers from 'dev8@127.0.0.1' to 'dev3@127.0.0.1'
  1 transfers from 'dev7@127.0.0.1' to 'dev2@127.0.0.1'
  1 transfers from 'dev6@127.0.0.1' to 'dev1@127.0.0.1'
  1 transfers from 'dev5@127.0.0.1' to 'dev8@127.0.0.1'
  1 transfers from 'dev4@127.0.0.1' to 'dev6@127.0.0.1'
  1 transfers from 'dev3@127.0.0.1' to 'dev5@127.0.0.1'
  1 transfers from 'dev2@127.0.0.1' to 'dev4@127.0.0.1'
  1 transfers from 'dev1@127.0.0.1' to 'dev3@127.0.0.1'
  1 transfers from 'dev8@127.0.0.1' to 'dev2@127.0.0.1'
  1 transfers from 'dev7@127.0.0.1' to 'dev1@127.0.0.1'
  1 transfers from 'dev6@127.0.0.1' to 'dev8@127.0.0.1'
  1 transfers from 'dev5@127.0.0.1' to 'dev6@127.0.0.1'
  1 transfers from 'dev4@127.0.0.1' to 'dev5@127.0.0.1'
  1 transfers from 'dev3@127.0.0.1' to 'dev4@127.0.0.1'
  1 transfers from 'dev2@127.0.0.1' to 'dev3@127.0.0.1'
  1 transfers from 'dev1@127.0.0.1' to 'dev2@127.0.0.1'
  2 transfers from 'dev8@127.0.0.1' to 'dev1@127.0.0.1'
  2 transfers from 'dev7@127.0.0.1' to 'dev8@127.0.0.1'

###############################################################################
                         After cluster transition 2/2
###############################################################################

================================= Membership ==================================
Status     Ring    Pending    Node
-------------------------------------------------------------------------------
valid      15.6%      --      dev1@127.0.0.1
valid      14.1%      --      dev2@127.0.0.1
valid      14.1%      --      dev3@127.0.0.1
valid      14.1%      --      dev4@127.0.0.1
valid      14.1%      --      dev5@127.0.0.1
valid      14.1%      --      dev6@127.0.0.1
valid      14.1%      --      dev8@127.0.0.1
-------------------------------------------------------------------------------
Valid:7 / Leaving:0 / Exiting:0 / Joining:0 / Down:0

WARNING: Not all replicas will be on distinct nodes

However - note the warning. The outcome is balanced, but the target_n_val is now not met.

martinsumner commented 2 years ago

In the case above - this seems to be a failure for a preflist at n_val 3 as well:

 riak_core_ring_util:check_ring(FR2).    
[[{1415829711164312202009819681693899175291684651008,
   'dev8@127.0.0.1'},
  {1438665674247607560106752257205091097473808596992,
   'dev2@127.0.0.1'},
  {0,'dev2@127.0.0.1'}],
 [{1438665674247607560106752257205091097473808596992, 
   'dev2@127.0.0.1'},
  {0,'dev2@127.0.0.1'},
  {22835963083295358096932575511191922182123945984,
   'dev3@127.0.0.1'}]]
systream commented 2 years ago

Not dug myself into the code yet, but if i understand correctly it would be nice to claim a new ring when a node leaves (at least when location is set), and maybe the claiming algorithm should be optimised for transfers.

martinsumner commented 2 years ago

The initial change I made https://github.com/basho/riak_core/commit/db9c3f017f612527ee85cfd9a8673ca3b2b799e9 isn't effective as the fallback position on leave is to use riak_core_claim:claim_rebalance_n/2 which is the old diagonalise algorithm which doesn't resolve tail violations.

Experimenting with calling riak_core_claim:sequential_claim/2 instead.

Also, after the call to riak_core_gossip:attempt_simple_transfer, there is still a call to riak_core_claim:claim/1 in https://github.com/basho/riak_core/blob/develop-3.0/src/riak_core_claimant.erl#L1454-L1468. This is what is currently triggering the second phase of the plan. However, it does not appear to rebalance as expected.

The location concerns may be false because of this.

martinsumner commented 2 years ago

After this commit:

=============================== Staged Changes ================================
Action         Details(s)
-------------------------------------------------------------------------------
leave          'dev7@127.0.0.1'
-------------------------------------------------------------------------------

NOTE: Applying these changes will result in 2 cluster transitions

###############################################################################
                         After cluster transition 1/2
###############################################################################

================================= Membership ==================================
Status     Ring    Pending    Node
-------------------------------------------------------------------------------
leaving    12.5%      0.0%    dev7@127.0.0.1
valid      12.5%     15.6%    dev1@127.0.0.1
valid      12.5%     14.1%    dev2@127.0.0.1
valid      12.5%     14.1%    dev3@127.0.0.1
valid      12.5%     14.1%    dev4@127.0.0.1
valid      12.5%     14.1%    dev5@127.0.0.1
valid      12.5%     14.1%    dev6@127.0.0.1
valid      12.5%     14.1%    dev8@127.0.0.1
-------------------------------------------------------------------------------
Valid:7 / Leaving:1 / Exiting:0 / Joining:0 / Down:0

Transfers resulting from cluster changes: 55
  1 transfers from 'dev7@127.0.0.1' to 'dev6@127.0.0.1'
  1 transfers from 'dev6@127.0.0.1' to 'dev5@127.0.0.1'
  1 transfers from 'dev5@127.0.0.1' to 'dev3@127.0.0.1'
  1 transfers from 'dev4@127.0.0.1' to 'dev2@127.0.0.1'
  1 transfers from 'dev3@127.0.0.1' to 'dev1@127.0.0.1'
  1 transfers from 'dev2@127.0.0.1' to 'dev8@127.0.0.1'
  1 transfers from 'dev1@127.0.0.1' to 'dev6@127.0.0.1'
  1 transfers from 'dev8@127.0.0.1' to 'dev5@127.0.0.1'
  1 transfers from 'dev7@127.0.0.1' to 'dev4@127.0.0.1'
  1 transfers from 'dev6@127.0.0.1' to 'dev3@127.0.0.1'
  1 transfers from 'dev5@127.0.0.1' to 'dev2@127.0.0.1'
  1 transfers from 'dev4@127.0.0.1' to 'dev1@127.0.0.1'
  1 transfers from 'dev3@127.0.0.1' to 'dev8@127.0.0.1'
  1 transfers from 'dev2@127.0.0.1' to 'dev6@127.0.0.1'
  1 transfers from 'dev1@127.0.0.1' to 'dev5@127.0.0.1'
  2 transfers from 'dev8@127.0.0.1' to 'dev4@127.0.0.1'
  2 transfers from 'dev7@127.0.0.1' to 'dev3@127.0.0.1'
  2 transfers from 'dev6@127.0.0.1' to 'dev2@127.0.0.1'
  2 transfers from 'dev5@127.0.0.1' to 'dev1@127.0.0.1'
  2 transfers from 'dev4@127.0.0.1' to 'dev8@127.0.0.1'
  2 transfers from 'dev3@127.0.0.1' to 'dev6@127.0.0.1'
  2 transfers from 'dev2@127.0.0.1' to 'dev5@127.0.0.1'
  2 transfers from 'dev1@127.0.0.1' to 'dev4@127.0.0.1'
  2 transfers from 'dev8@127.0.0.1' to 'dev3@127.0.0.1'
  1 transfers from 'dev7@127.0.0.1' to 'dev2@127.0.0.1'
  1 transfers from 'dev6@127.0.0.1' to 'dev1@127.0.0.1'
  1 transfers from 'dev5@127.0.0.1' to 'dev8@127.0.0.1'
  1 transfers from 'dev4@127.0.0.1' to 'dev6@127.0.0.1'
  1 transfers from 'dev3@127.0.0.1' to 'dev5@127.0.0.1'
  1 transfers from 'dev2@127.0.0.1' to 'dev4@127.0.0.1'
  1 transfers from 'dev1@127.0.0.1' to 'dev3@127.0.0.1'
  1 transfers from 'dev8@127.0.0.1' to 'dev2@127.0.0.1'
  2 transfers from 'dev7@127.0.0.1' to 'dev1@127.0.0.1'
  2 transfers from 'dev6@127.0.0.1' to 'dev8@127.0.0.1'
  2 transfers from 'dev5@127.0.0.1' to 'dev6@127.0.0.1'
  2 transfers from 'dev4@127.0.0.1' to 'dev5@127.0.0.1'
  2 transfers from 'dev3@127.0.0.1' to 'dev4@127.0.0.1'
  1 transfers from 'dev2@127.0.0.1' to 'dev3@127.0.0.1'
  1 transfers from 'dev1@127.0.0.1' to 'dev2@127.0.0.1'
  1 transfers from 'dev8@127.0.0.1' to 'dev1@127.0.0.1'
  1 transfers from 'dev7@127.0.0.1' to 'dev8@127.0.0.1'

###############################################################################
                         After cluster transition 2/2
###############################################################################

================================= Membership ==================================
Status     Ring    Pending    Node
-------------------------------------------------------------------------------
valid      15.6%      --      dev1@127.0.0.1
valid      14.1%      --      dev2@127.0.0.1
valid      14.1%      --      dev3@127.0.0.1
valid      14.1%      --      dev4@127.0.0.1
valid      14.1%      --      dev5@127.0.0.1
valid      14.1%      --      dev6@127.0.0.1
valid      14.1%      --      dev8@127.0.0.1
-------------------------------------------------------------------------------
Valid:7 / Leaving:0 / Exiting:0 / Joining:0 / Down:0

This is when setting always_rebalance_onleave to on. This is an improvement in that if simple_transfer is going to lead to a significant temporary unbalance - this can be avoided by directly rebalancing from scratch, using the new algorithm which will handle tail violations.

martinsumner commented 2 years ago

I think I now have a handle on the original problem that prompted this. The problem was a cluster plan like this:

================================= Membership ==================================
Status     Ring    Pending    Node
-------------------------------------------------------------------------------
leaving     4.7%      0.0%    'riak@<x>.<y>.<z>.30'
valid       4.7%      4.7%    'riak@<x>.<y>.<z>.11'
valid       5.1%      5.1%    'riak@<x>.<y>.<z>.12'
valid       4.7%      4.7%    'riak@<x>.<y>.<z>.129'
valid       4.7%      4.7%    'riak@<x>.<y>.<z>.13'
valid       4.7%      4.7%    'riak@<x>.<y>.<z>.14'
valid       4.7%      5.5%    'riak@<x>.<y>.<z>.15'
valid       4.7%      4.7%    'riak@<x>.<y>.<z>.16'
valid       4.7%      4.7%    'riak@<x>.<y>.<z>.17'
valid       4.7%      5.1%    'riak@<x>.<y>.<z>.18'
valid       4.7%      5.5%    'riak@<x>.<y>.<z>.19'
valid       4.7%      4.7%    'riak@<x>.<y>.<z>.20'
valid       5.1%      5.1%    'riak@<x>.<y>.<z>.21'
valid       5.1%      5.1%    'riak@<x>.<y>.<z>.22'
valid       5.1%      6.3%    'riak@<x>.<y>.<z>.23'
valid       4.7%      5.5%    'riak@<x>.<y>.<z>.24'
valid       4.7%      5.1%    'riak@<x>.<y>.<z>.25'
valid       4.7%      5.1%    'riak@<x>.<y>.<z>.26'
valid       4.7%      4.7%    'riak@<x>.<y>.<z>.27'
valid       4.7%      4.7%    'riak@<x>.<y>.<z>.28'
valid       4.7%      4.7%    'riak@<x>.<y>.<z>.29'
-------------------------------------------------------------------------------
Valid:20 / Leaving:1 / Exiting:0 / Joining:0 / Down:0

The issue with this plan is that one node takes two extra vnodes valid 5.1% 6.3% 'riak@<x>.<y>.<z>.23'. this is after a simple_transfer - so why did the call to riak_core_claim:claim/1 do nothing, and fail to rebalance this.

The problem is that claim_until_balanced looks for nodes with Wants:

https://github.com/basho/riak_core/blob/riak_kv-3.0.5/src/riak_core_claim.erl#L101-L114

The issue here, is that after the simple_transfer, every node already has at least RingSize div NodeCount vnodes. A node has an excess of Claims, but no node has a deficit of Wants - so the ring is considered balanced prematurely.

This isn't going to be triggered on small clusters. It will only certainly be a problem when:

RingSize div NodeCount == RingSize div (NodeCount - LeaveCount)

e.g. when leaving just one node from a large cluster (in this case going from 21 nodes to 20 with a ring size of 256). However, there could be other cases when rebalancing will commence, but end prematurely (whilst nodes still have an excess of claims).

martinsumner commented 2 years ago

@systream - apologies, false alarm, I don't think there is an issue with location awareness here. This is just an issue with balancing the cluster correctly.

martinsumner commented 2 years ago

Clarification as to the issue:

On leave, the following process is used:

1a. Attempt a simple_transfer, try and transfer the vnodes from the nodes to safe places - and if there are multiple safe places choose at random 1b. If the simple_transfer is not possible i.e. a state is reached where a vnode has no safe place (compatible with target_n_val) - then a full re-diagonalisation is called (and entirely fresh plan is created with a set of transfers needed to get there) 2a. After step 1 claim is re-run so that claim_until_balanced will be run to rebalance the ring (if in the case of 1a the simple_transfer has created an unbalanced ring) 2b. The re-run of claim may if also hit an unsafe outcome, and then trigger a full re-diagonalisation (as in 1b).

The issues we have are:

A: In Step 1b the deprecated re-diagonalisation claim_rebalance_n is used, not sequential_claim. The deprecated rebalance function does not avoid tail violations - and so may unnecessarily return an unsafe cluster plan (with the warning that "Not all replicas will be on distinct nodes").

B: In Step 1a an extremely unbalanced cluster may be created (i.e. one node may take all the transferred vnodes from the leaving vnode). This may be unsupportable from a capacity perspective. Commonly in this case, Step 2b will be invoked, and so re-diagonalisation will occur anyway. In this case it would would be preferable to Skip 1a, and force the use of 1b.

C: Sometimes Step 2b will not rebalance correctly, especially when leaving small numbers of nodes from large clusters - where RS div NodeCount == RS div (NodeCount - LeaveCount). Fixing this would be a significant change to the claim function. It could be mitigated by a simpler change to the simple_transfer function to bias the original distribution to make unbalances less likely in Step 1a.

The branch https://github.com/basho/riak_core/tree/mas-i970-simpleleave currently provides:

Fix to (A). Workaround to (B) - force Step 1b via configuration.

systream commented 2 years ago

Are there any use cases where a customer prefers a probably faster leave over potentially imbalanced/not proper ring? Maybe ring rebalance should be always forced on node leave.

martinsumner commented 2 years ago

The PR https://github.com/basho/riak_core/pull/971 provides an improvement to B. This also partially mitigates issues with C, as the input to C is less likely to be imbalanced

martinsumner commented 2 years ago

@systream with very big databases a full re-diagonalisation to rebalance can take a huge amount of time (many days potentially) - so I think there will always be cases where a simple_transfer follow by rebalance by claim is the preferred way.

I think the fixes we now have to this issue provide sufficient mitigation and improvements now. Just need to try and figure out the best way to test this.

martinsumner commented 2 years ago

There appears to be an additional bug as the attempt_simple_transfer does not check for tail violation - that is to say when it reaches the tail of the ring, it does not check forward back through to the front of the ring (i.e. with ring-size of 64 - the Idx 63 should not overlap with the Idx 0, 1, 2).

martinsumner commented 2 years ago

The attempt_simple_transfer logic is now refactored in https://github.com/basho/riak_core/pull/971. This now uses the same method for looking forward and backwards in the ring - with that method supporting of checking for tail violations, as it loops around the ring when required.