basho / riak_core

Distributed systems infrastructure used by Riak.
Apache License 2.0
1.23k stars 392 forks source link

Ordering of transfers and the impact of `forced_ownership_handoff` threshold #1013

Open martinsumner opened 10 months ago

martinsumner commented 10 months ago

In an ownership change (i.e. join, leave, replace or location change), once a transfer commences, assuming vnodes are sufficiently busy to prevent the vnode_inactivity_timeout from triggering - then all transfers will be prompted by the vnode manager's scheduled tick (default every 10s).

If there are pending transfers, this takes the list of transfers from riak_core_ring:pending_changes/1 and applies the forced_ownership_handoff threshold (default 8) to examine only the first threshold items in the list.

The vnode_manager looks for the existence of itself (as a sender) in that sublist, and if there is a valid potential transfer candidate matching, the transfer will then be attempted (and may or may not succeed depending on the per-node handoff_concurrency transfer limit).

There are aspects of this which are potentially confusing:

As a minimum there is a need to update the riak_core_schema to enable the configuration of forced_ownership_handoff, and also to clarify in the transfer limit configuration stanza that the limit may not be reached without also re-configuring forced_ownership_handoff.

It also may be worth considering whether using partition-order on riak_core_ring:pending_changes/1 is correct. Should the order be deterministically set so that nodes receiving transfers are evenly distributed through the list, so that they don't over-grow on an interim basis by first receiving then offloading. However, there may be some data-safety in the existing ring order. In that by keeping changes within the same preflist close together in order, it reduces any window whereby that preflist may be temporarily in an unsafe state (i.e. if partition in position M is moving to node X, and partition M + 1 is moving from X - if these two moves happened at the beginning and end of the list of changes, the window of unsafety when target_n_val is not met is unnecessarily extended).

A possible re-ordering of the output of riak_core_ring:pending_changes/1 from within the riak_core_vnode_manager:trigger_ownership_handoff/4 might be to place all pending_changes for joining nodes at the top of the list, and leave the remainder ordered by ring position. This would minimise the risk of interim states during joins either overloading disk-space on existing nodes, by first maximising use of fresh capacity.