OpenRiak / riak_kv

Riak Key/Value Store
0 stars 0 forks source link

Repeated cost of `aae_fetchclocks_repair` #25

Open martinsumner opened 7 months ago

martinsumner commented 7 months ago

The riak_kv/aae_fetchclocks_repair environment variable is used to handle the issue of corrupted AAE segment hash. The root cause of this issue is unknown, but there may be a circumstance whereby a segment in the AAE tree does not correctly represent the accumulation of Keys/Clocks in that hash.

In that circumstance - in some very rare cases, all 3 vnodes on a cluster may have het same misrepresentation, whilst another cluster (which is in-sync) does not. In this case full-sync will show a single mismatched segment - but clock comparisons will prompt no repairs.

This situation is resolved by the riak_kv_ttaaefs_manager detecting this situation after running an all_check and then calling riak_kv_ttaaefs_manager:trigger_tree_repairs/0. This will switch the environment variable riak_kv/aae_fetchclocks_repair to true, which will prompt the next AAE compare clocks query to use aae_controller:aae_fetchclocks/5 to return the keys and clocks (a query which will also rebuild the AAE tree cache segment for this preflist/vnode).

One riak_kv_ttaaefs_manager detects the clusters are in sync, riak_kv/aae_fetchclocks_repair is reset to false by calling riak_kv_ttaaefs_manager:disable_tree_repairs/0.

Setting this has a local impact on this node only. The faulty cached trees may not be on this node. As each node discovers the faulty cache it will enable the repair mode, and eventually a faulty cache should be detected, and subsequent queries should revert to the standard type.

However, during this process an individual vnode may needlessly repair an unbroken segment multiple times. Running a fetch clocks query has an elevated cost, and also lacks concurrency controls (as it uses the per-vnode aae_controller's aae_runner queue rather than the cross-node af3_queue). In a 12 node cluster, the first node to trigger repair will repair each vnode on average 4 times before it is disabled (assuming N=3).