Riak_kv K8s memory consumption

basho / riak_kv

Riak Key/Value Store

Apache License 2.0

650 stars 233 forks source link

Riak_kv K8s memory consumption #1889

Open azowarg opened 3 months ago

azowarg commented 3 months ago

Hello, I would like to ask for some advice. We have a 5-node riak kv nodes in ring installed in a kubernetes cluster. About 500 GB of data is stored on each node, but there are also butch of corrupted keys. We use "leveldb" as the backend and have enabled tictacaae.

storage_backend = leveldb
anti_entropy = passive
tictacaae_active = active

1) The first problem is related to memory consumption in the K8s environment. In our case with corrupted keys, we have the opinion that riak make is trying to restore them, but it uses all the requested memory and goes beyond the pod memory limits. With this behavior regarding memory consumption, the pod receives an OOM Killer signal. Is it possible to set memory limits for the riak application? We tried several workarounds related to the leveldb.maximum_memory.percentage or level db.maximum_memory settings, but without any result. 2) Is there a scenario for some kind of "forced" key recovery? How do I track the progress of the "recovery"? Looks like riak doing something but we don't have control on it and can't make any influence.

Thanks in advance, I am ready to provide any additional information.

martinsumner commented 3 months ago

What version of Riak? How many keys overall? What is the average size of the keys (including bucket and type names)? What is the per-pod memory limit.

Enabling the leveldb limits doesn't alter the leveled backend demand for memory, and enabling tictacaae will enable the leveled backend. There has been a lot of work on memory efficiency and AAE efficiency recently, but much of that has yet to make it through to the basho repos.

There are some specific operations that can be memory intensive (like AAE store rebuilds). Is there a pattern of the memory suddenly spiking?

If memory is very tight, you probably have more control over memory if you use the legacy anti_entropy rather than tictacaae. The tictacaae is designed to work in native mode (when the storage_backend is leveled), running in parallel mode (i.e. with another backend) like this, has had far less non-functional testing.

In terms of recovery, AAE is used to discover differences, and then those differences are resolved by AAE prompting read repair. Read repair will read the values for a suspect key from all N vnodes that hold them, and over-write any out-of-date values (using the vector clocks to determine what is out of date).

The process of AAE discovering the differences are:

build a key store of keys/clocks (using the leveled backend) for each vnode.
build a cached (merkle-style) hash tree of the key store of keys/clocks for each partition supported by the vnode.
exchange the partition-specific tree with another vnode which also holds such a partition-specific tree.
discover any broken "segments" in the tree - portions of the hash-space where there are deltas.
compare the keys/clocks in those segments by querying the key-store to find keys within the segment.
prompt read-repair for any key where the clocks are different.

There are some logs in the riak_kv_vnode module that summarise what happens in each cycle, as well as some stats in riak stats. If you're getting memory spikes you might want to look for logs about rebuilds starting/completing and see if they correlate.

azowarg commented 3 months ago

Thank you for the quick response! Version of riak is 3.0.12. More then 100kk keys in all and some buckets holds more then 10kk keys inside. At the first boot, no more than 5-6 gb of memory are consumed by one pod. As soon as requests to the node begin, memory consumption starts to gradually increase (we assume that the corrupted keys are accessed and the internal read_repair procedure starts. We also launched a separate copy of our cluster with data backup without load and requests from our services. A similar cluster behavior is starts if, for example, you run the procedure manually riak_client:aae_fold({repair_key_range, <<"any_bucket">>, all, all, all}).). For a while, we used the 20gb limits on the pod and wait for ending of internal read_repair processes . Over the past couple of weeks, we have moved the pods to 30Gb limits, so the idea was to allocate more to check the behavior of the cluster. But over time, memory consumption comes to pods limits and it does not decrease over time. Here is riak admin top output form node with 30gb memory consumtion.

========================================================================================
 'riak@riak-4.riak-headless'                                               11:53:25
 Load:  cpu         0               Memory:  total     7972862    binary    3530027
        procs   17207                        processes 4339132    code        18238
        runq        0                        atom         5685    ets         38894

Pid            Name or Initial Func    Time    Reds  Memory    MsgQ Current Function
----------------------------------------------------------------------------------------
<8769.17478.0> aae_runner:init/1        '-'******** 1574136       0 gen:do_call/4
<8769.16084.0> aae_runner:init/1        '-'********  973164       0 gen_server:loop/7
<8769.16962.0> aae_runner:init/1        '-'********  601780       0 gen_server:loop/7
<8769.17360.0> aae_runner:init/1        '-'********  372252       0 gen_server:loop/7
<8769.770.0>   riak_core_vnode_work     '-'******** 2546416       0 gen_server:loop/7
<8769.16319.0> riak_core_vnode_work     '-'********   13800       0 gen_server:loop/7
<8769.17365.0> riak_core_vnode_work     '-'********    8864       0 gen_server:loop/7
<8769.16980.0> riak_core_vnode_work     '-'********    8864       0 gen_server:loop/7
<8769.16307.0> riak_core_vnode_work     '-'********    8864       0 gen_server:loop/7
<8769.17332.0> riak_core_vnode_work     '-'********   13800       0 gen_server:loop/7
========================================================================================

========================================================================================
 'riak@riak-4.riak-headless'                                               11:53:31
 Load:  cpu         1               Memory:  total     7965241    binary    3517755
        procs   17206                        processes 4344504    code        18238
        runq        0                        atom         5685    ets         38937

Pid            Name or Initial Func    Time    Reds  Memory    MsgQ Current Function
----------------------------------------------------------------------------------------
<8769.17478.0> aae_runner:init/1        '-'19648973  973224       0 gen:do_call/4
<8769.15481.0> leveled_sst:init/1       '-'  432374  372404       0 gen_fsm:loop/8
<8769.15280.0> leveled_sst:init/1       '-'  427503  372404       0 gen_fsm:loop/8
<8769.3200.0>  leveled_sst:init/1       '-'  424963  372404       0 gen_fsm:loop/8
<8769.3224.0>  leveled_sst:init/1       '-'  424282  372404       0 gen_fsm:loop/8
<8769.3189.0>  leveled_sst:init/1       '-'  423038  372404       0 gen_fsm:loop/8
<8769.3267.0>  leveled_sst:init/1       '-'  422548  372404       0 gen_fsm:loop/8
<8769.15246.0> leveled_sst:init/1       '-'  422333  372404       0 gen_fsm:loop/8
<8769.15491.0> leveled_sst:init/1       '-'  422023  372404       0 gen_fsm:loop/8
<8769.15740.0> leveled_sst:init/1       '-'  421316  230548       0 gen_fsm:loop/8
========================================================================================

All context about behavior and seeking of corrupted keys can be find in our previous issue from my collegue. So we tried to "force" the recovery of these keys manually making requests for them or using aae_fold queries like riak_client:aae_fold({object_stats, <<"any_bucket">>, all, {date, {{2023, 8, 5}, {12, 0, 0}}, {{2023, 8, 5}, {13, 0, 0}}}}). But it is impossible not to predict the recovery. For example, riak may give you this key in a couple of days by curl request, or maybe not. Also additional entries from logs appears sometimes.

Inbound clock entry for <<212,115,185,63,164,208,25,139>> in <<"example_bucket_1">>/<<"example_key_1">> greater than local.Epochs: {In:3543632 Local:0}. Counters: {In:41 Local:0}.
Inbound clock entry for <<212,115,185,63,164,208,25,139>> in <<"example_bucket_1">>/<<"example_key_1">> greater than local.Epochs: {In:0 Local:0}. Counters: {In:11 Local:0}.
Inbound clock entry for <<212,115,185,63,164,208,25,139>> in <<"example_bucket_1">>/<<"example_key_1">> greater than local.Epochs: {In:0 Local:0}. Counters: {In:1 Local:0}.
Inbound clock entry for <<124,245,8,174,118,96,56,16>> in <<"example_bucket_2">>/<<"example_key_2">> greater than local.Epochs: {In:7965961 Local:0}. Counters: {In:1 Local:0}.
Inbound clock entry for <<124,245,8,174,118,96,56,16>> in <<"example_bucket_2">>/<<"example_key_2">> greater than local.Epochs: {In:7895995 Local:0}. Counters: {In:1 Local:0}.
Inbound clock entry for <<124,245,8,174,118,96,56,16>> in <<"example_bucket_2">>/<<"example_key_2">> greater than local.Epochs: {In:7917020 Local:0}. Counters: {In:1 Local:0}.

martinsumner commented 3 months ago

The underlying issue of corrupting objects is fixed in this PR https://github.com/basho/riak_kv/pull/1882. This is in the nhs-riak 3.2.2 release, but hasn't been accepted into the basho repo yet.

The key amnesia logs I assume are from 3 different vnodes. The Inbound counter on one of them is showing nearly 8m updates for that vnode alone. Is that expected through normal activity? I assume not, and in fact these keys are stuck in some continuous AAE exchange/repair key churn. Do you think this is the case? What is the output of this log - https://github.com/basho/riak_kv/blob/develop-3.0/src/riak_kv_vnode.erl#L1215-L1226

If you are in some continuous exchange churn you might be able to improve your symptoms by increasing your exchange tick to reduce the frequency of activity - https://github.com/basho/riak_kv/blob/develop-3.0/priv/riak_kv.schema#L150-L162.

There are a bunch of improvements to exchange efficiency that are in the latest nhs-riak 3.2.2 release that would help too. However, I have no way of getting these upstream into the basho repo at present. You may wish to consider the relative risks of building your own release from nhs-riak if you need these changes.

azowarg commented 3 months ago

The output for this log. exchanges_expected and exchanges_completed are always 6 for different partitions and in output for different nodes in general.

Tictac AAE loop completed for partition=1438665674247607560106752257205091097473808596992 with exchanges expected=6 exchanges completed=6 total deltas=790 total exchange_time=886 seconds loop duration=3360 seconds (elapsed)
Tictac AAE loop completed for partition=0 with exchanges expected=6 exchanges completed=6 total deltas=501 total exchange_time=341 seconds loop duration=3360 seconds (elapsed)
Tictac AAE loop completed for partition=68507889249886074290797726533575766546371837952 with exchanges expected=6 exchanges completed=6 total deltas=0 total exchange_time=5 seconds loop duration=3360 seconds (elapsed)
Tictac AAE loop completed for partition=137015778499772148581595453067151533092743675904 with exchanges expected=6 exchanges completed=7 total deltas=2934 total exchange_time=1949 seconds loop duration=3360 seconds (elapsed)
Tictac AAE loop completed for partition=159851741583067506678528028578343455274867621888 with exchanges expected=6 exchanges completed=6 total deltas=627 total exchange_time=779 seconds loop duration=3360 seconds (elapsed)
Tictac AAE loop completed for partition=1118962191081472546749696200048404186924073353216 with exchanges expected=6 exchanges completed=6 total deltas=0 total exchange_time=5 seconds loop duration=3360 seconds (elapsed)
Tictac AAE loop completed for partition=456719261665907161938651510223838443642478919680 with exchanges expected=6 exchanges completed=6 total deltas=707 total exchange_time=209 seconds loop duration=3360 seconds (elapsed)
Tictac AAE loop completed for partition=1187470080331358621040493926581979953470445191168 with exchanges expected=6 exchanges completed=6 total deltas=2133 total exchange_time=1344 seconds loop duration=3360 seconds (elapsed)

martinsumner commented 3 months ago

AAE is working very hard. You may be able to ease your symptoms by increasing the exchangetick.

i.e. from riak remote_console (on each node):

application:set_env(riak_kv, tictacaae_exchangetick, 1440000).

The change can be made permanently by changing the setting in riak.conf - https://github.com/nhs-riak/riak_kv/blob/nhse-develop/priv/riak_kv.schema#L150-L162.

The long-term answer is to upgrade, to a version with the necessary fixes and performance improvements (i.e. Riak 3.2.2 with OTP22), but that would require you to build your own packages from the nhs-riak repo, given the current delays bringing the basho repo up-to-date.

azowarg commented 3 months ago

Thanks for the advice, we are going to try to build a new image with the new version.

Just for clarification, I found min 8 vnodes in the cluster that output entries with Inbound clock entry for. Someone sends a log more often, someone less. So maybe there are any tools to work and help with our case exactly in these vnodes?

And could you please explain in more detail what exactly parameters mean in the output of this log. Inbound clock entry for << node_id >> in <<"example_bucket_1">>/<<"example_key_1">> greater than local.Epochs: {In:3543632 Local:0}. Counters: {In:41 Local:0}. Especially how "In" and "local" values are calculated and their meaning.

martinsumner commented 3 months ago

There's an explanation of incrementing key epochs here. The linked issue (KV679) and related PRs contain more information.

"Local" is referring to the vector clock of the object being replaced in the vnode, "In" is the new PUT arriving at the vnode.

If you look at the log_key_amnesia function in the riak_kv_vnode, you can see it is called in two cases. If the object is not_found, then the Local values are assumed to be 0/0. So in this case, a PUT arrived - that according to the vector clock, this vnode has coordinated 41 changes to this object ... but instead found it wasn't present.

The KV679 issue that this is there to solve is complicated. It covers various corner cases, which include data being deleted due to backend corruption/failure. But also the affect of deleting keys (when delete_mode =/= keep), and then writing new values to the object, only for "older" tombstones to be resurrected (known as the dombstone scenario).

If you've highlighted 8 vnodes, it might be worth looking at their relative location in the ring i.e. which preflists they share. If they don't share preflists, you might have potential solutions involving partition repair. If they share preflists, then there might be some further clues as to what is happening.

azowarg commented 3 months ago

After a couple of weeks of testing, we tried several scenarios. First, we decided to trigger data migration/transferring by adding a node to our test cluster. After it became a cluster member we started to exclude the remaining nodes from the cluster one by one. I mean, we have taken a node and left it from the cluster, and after successfully ring rebuild, we have joined this node to the cluster again. The first node has finished the leave/join process in 4 days and remains approximately in 24-36 hours. As a result, we found that log_key_amnesia outputs disappeared and don't appear anymore. Unfortunately, this doesn't make any changes to our "keys" case and resource consumption. Results, for example, for aae_fold erase_keys count shows any number of keys (this cluster doesn't connect to our services and write any inbound data)

(riak@riak-1.riak-headless)2> riak_client:aae_fold({erase_keys,<<"bucket">>,all,all,all,count}).
{ok,122292978}
(riak@riak-1.riak-headless)3> riak_client:aae_fold({erase_keys,<<"bucket">>,all,all,all,count}).
{ok,118788850}
(riak@riak-1.riak-headless)4> riak_client:aae_fold({erase_keys,<<"bucket">>,all,all,all,count}).
{ok,118154097}
(riak@riak-1.riak-headless)5> riak_client:aae_fold({erase_keys,<<"bucket">>,all,all,all,count}).
{ok,118154221}
(riak@riak-1.riak-headless)6> riak_client:aae_fold({erase_keys,<<"bucket">>,all,all,all,count}).
{ok,117811339}
(riak@riak-1.riak-headless)7> riak_client:aae_fold({erase_keys,<<"bucket">>,all,all,all,count}).
{ok,122261776}

We have changed the version of riak as you noticed to the 3.2.2 version from your NHS repo. It works but we don't see any visible effect on our data corruption case, but noticed that CPU consumption became less. Then we have tried to rebuild tictac_aae trees. We stopped tictac on every node, restarted, deleted tictac_aae directory, and enabled tictac again. The replica cluster at the beginning has this directory space allocation for each of 5 nodes.

0
220G    ./tictac_aae
256G    ./leveldb
1
272G    ./leveldb
249G    ./tictac_aae
2
264G    ./leveldb
174G    ./tictac_aae
3
199G    ./tictac_aae
273G    ./leveldb
4
170G    ./tictac_aae
251G    ./leveldb

Replica cluster for test after adding and leave/join procedure:

0
132G    ./tictac_aae
202G    ./leveldb
1
202G    ./leveldb
142G    ./tictac_aae
2
219G    ./leveldb
158G    ./tictac_aae
3
136G    ./tictac_aae
100G    ./leveldb
4
155G    ./tictac_aae
112G    ./leveldb
5
223G    ./leveldb
58G         ./tictac_aae

After deleting the directory and rebuilding:

0
141G    ./tictac_aae
202G    ./leveldb
1
202G    ./leveldb
142G    ./tictac_aae
2
219G    ./leveldb
150G    ./tictac_aae
3
136G    ./tictac_aae
201G    ./leveldb
4
155G    ./tictac_aae
222G    ./leveldb
5
223G    ./leveldb
155G    ./tictac_aae

Is it normal that our tictac_aae directories have, in my opinion, so big size in comparison with leveldb directories with data? We don't enable storeheads.

      {tictacaae_primaryonly,true},
      {tictacaae_rangeboost,2},
      {tictacaae_repairloops,4},
      {tictacaae_maxresults,64},
      {tictacaae_rebuild_blocktime,1000},
      {tictacaae_rebuildtick,3600000},
      {tictacaae_exchangetick,480000}, <-- # tried to increase as you mentioned, seems it has little effect on memory consumption.
      {tictacaae_storeheads,false},
      {tictacaae_rebuilddelay,345600},
      {tictacaae_rebuildwait,336},
      {tictacaae_parallelstore,leveled_ko},

This still does not affect keys but feels like memory consumption decreased. And Tictac AAE loop completed logs look as before:

Jul 4, 2024 @ 14:11:37.079 | Tictac AAE loop completed for partition=342539446249430371453988632667878832731859189760 with exchanges expected=6 exchanges completed=6 total deltas=0 total exchange_time=6 seconds loop duration=3360 seconds (elapsed)
Jul 4, 2024 @ 14:11:26.912 | Tictac AAE loop completed for partition=570899077082383952423314387779798054553098649600 with exchanges expected=6 exchanges completed=6 total deltas=0 total exchange_time=6 seconds loop duration=3360 seconds (elapsed)
Jul 4, 2024 @ 14:10:56.878 | Tictac AAE loop completed for partition=844930634081928249586505293914101120738586001408 with exchanges expected=6 exchanges completed=6 total deltas=0 total exchange_time=5 seconds loop duration=3360 seconds (elapsed)
Jul 4, 2024 @ 14:10:50.598 | Tictac AAE loop completed for partition=1347321821914426127719021955160323408745312813056 with exchanges expected=6 exchanges completed=6 total deltas=0 total exchange_time=5 seconds loop duration=3360 seconds (elapsed)
Jul 4, 2024 @ 14:10:27.824 | Tictac AAE loop completed for partition=1187470080331358621040493926581979953470445191168 with exchanges expected=6 exchanges completed=6 total deltas=4013 total exchange_time=472 seconds loop duration=3360 seconds (elapsed)
Jul 4, 2024 @ 14:02:26.658 | Tictac AAE loop completed for partition=137015778499772148581595453067151533092743675904 with exchanges expected=6 exchanges completed=6 total deltas=2478 total exchange_time=615 seconds loop duration=3360 seconds (elapsed)

martinsumner commented 3 months ago

I would expect tictac_aae folders to be much smaller than leveldb folders, unless your object sizes are very small (i.e. much less than 1KB). If you're not using store_heads - it is just keys and clocks I think (and a very small amount of additional metadata).

martinsumner commented 3 months ago

I don't understand what is happening. There is an environment variable in riak_kv of log_readrepair - you can set that to true, and it will log out the actual keys that it is repairing.

You may then be able to get some clues - are they all from one bucket, are they from a specific set of keys that have different activity patterns etc. It also logs the details of the vector clocks that differ, which may offer some further clues.

azowarg commented 3 months ago

Does every log appearance for each unique key mean that this key is corrupted and will be passed to the recovering process? I have received 15k output examples in at least 15 minutes for different buckets. Does "none" appearance in "Clocks" field normal behavior?

  | Jul 4, 2024 @ 17:41:14.342 | Prompted read repair Bucket=<<"bucket_1">> Key=<<"key_1">> Clocks none [{<<124,245,8,174,26,199,201,241,0,160,152,128>>,{1,63868219612}}]
  | Jul 4, 2024 @ 17:41:14.342 | Prompted read repair Bucket=<<"bucket_1">> Key=<<"key_2">> Clocks none [{<<124,245,8,174,26,199,201,241,0,160,223,7>>,{1,63868261878}}]
  | Jul 4, 2024 @ 17:41:14.342 | Prompted read repair Bucket=<<"bucket_1">> Key=<<"key_3">> Clocks none [{<<109,136,98,215,109,134,234,195,0,122,222,116>>,{1,63868281693}}]
  | Jul 4, 2024 @ 17:41:14.342 | Prompted read repair Bucket=<<"bucket_1">> Key=<<"key_4">> Clocks none [{<<109,136,98,215,109,134,234,195,0,124,133,202>>,{1,63868471749}}]
  | Jul 4, 2024 @ 17:41:14.342 | Prompted read repair Bucket=<<"bucket_1">> Key=<<"key_5">> Clocks none [{<<109,136,98,215,109,134,234,195,0,121,220,87>>,{1,63868114394}}]
  | Jul 4, 2024 @ 17:41:14.342 | Prompted read repair Bucket=<<"bucket_1">> Key=<<"key_6">> Clocks none [{<<197,14,211,70,72,91,101,25,0,131,235,82>>,{1,63868230684}}]
  | Jul 4, 2024 @ 17:41:14.342 | Prompted read repair Bucket=<<"bucket_1">> Key=<<"key_7">> Clocks none [{<<109,136,98,215,109,134,234,195,0,123,243,186>>,{1,63868400663}}]
  | Jul 4, 2024 @ 17:41:14.341 | Prompted read repair Bucket=<<"bucket_1">> Key=<<"key_8">> Clocks none [{<<124,245,8,174,26,199,201,241,0,154,156,182>>,{1,63867293695}}]
  | Jul 4, 2024 @ 17:41:14.341 | Prompted read repair Bucket=<<"bucket_1">> Key=<<"key_9">> Clocks none [{<<109,136,98,215,109,134,234,195,0,119,187,209>>,{1,63867708430}}]

martinsumner commented 3 months ago

A clock of none means the key doesn't exist in one vnode. There's only one vnode in the other clock, and this has a counter of 1 - so this is a fresh write.

However, the timestamp in the clock is not as expected:

calendar:gregorian_seconds_to_datetime(63868219612).
{{2023,11,26},{12,6,52}}

So this is a key that has a single update (the initial write), made on 26th November 2023 at 12:06:52 - that somehow isn't in the other vnode.

Might the key have been deleted recently? Do you expect the key to exist? Did something odd happen late November (most of these timestamps are from similar date/times).

martinsumner commented 3 months ago

Sorry, I ignored your question about the recovery process.

The recovery process is read repair, and this is logged at the start of that recovery process - so the AAE has discovered this difference between the clocks, and it will now prompt a GET of an object (internally, but using the same GET_FSM as an external fetch).

It is expected that GET_FSM will discover the same difference, and and the end of that process it will prompt any vnode that doesn't have the object (i.e. the one with clock of none) to take the retrieved the object.

there are some stats in riak admin status of read_repairs/skipped_read_repairs these should start incrementing as the repair process resolves.

Obviously if something is broken with the repair process related to the state of these objects, then if you wait you will see a repeating pattern of the same keys being passed to repair.

azowarg commented 3 months ago

Might the key have been deleted recently? Do you expect the key to exist? Did something odd happen late November (most of these timestamps are from similar date/times).

We never delete keys so it should be present in the base. I don't know if specific dates are a clue in case I found a lot of different timestamps. Just for example:

(riak@riak-0.riak-headless)5> calendar:gregorian_seconds_to_datetime(63853864562).
{{2023,6,13},{8,36,2}}
(riak@riak-0.riak-headless)6> calendar:gregorian_seconds_to_datetime(63865469412).
{{2023,10,25},{16,10,12}}
(riak@riak-0.riak-headless)7> calendar:gregorian_seconds_to_datetime(63868333718).
{{2023,11,27},{19,48,38}}
(riak@riak-0.riak-headless)8> calendar:gregorian_seconds_to_datetime(63864702431).
{{2023,10,16},{19,7,11}}
(riak@riak-0.riak-headless)9> calendar:gregorian_seconds_to_datetime(63858361067).
{{2023,8,4},{9,37,47}}
(riak@riak-0.riak-headless)10> calendar:gregorian_seconds_to_datetime(63852677904).
{{2023,5,30},{14,58,24}}
(riak@riak-0.riak-headless)11> calendar:gregorian_seconds_to_datetime(63846658343).
{{2023,3,21},{22,52,23}}
(riak@riak-0.riak-headless)12> calendar:gregorian_seconds_to_datetime(63848078575).
{{2023,4,7},{9,22,55}}

Stats from riak admin status | grep read_repair seems to look normal. Counters are increasing slowly after some periods but for some reason *_outofdate_* counters have "undefined" values. Is it ok?

read_repairs : 0
read_repairs_counter : 0
read_repairs_counter_total : 0
read_repairs_fallback_notfound_count : 123629
read_repairs_fallback_notfound_one : 0
read_repairs_fallback_outofdate_count : undefined
read_repairs_fallback_outofdate_one : undefined
read_repairs_hll : 0
read_repairs_hll_total : 0
read_repairs_map : 0
read_repairs_map_total : 0
read_repairs_primary_notfound_count : 213861
read_repairs_primary_notfound_one : 0
read_repairs_primary_outofdate_count : undefined
read_repairs_primary_outofdate_one : undefined
read_repairs_set : 0
read_repairs_set_total : 0
read_repairs_total : 336530
skipped_read_repairs : 0
skipped_read_repairs_total : 0

For something odd and abnormal I can remember the situation when one of the nodes can't successfully start after rollout restart. This situation was not the only one in the lifetime of the cluster. The node that cannot start the riak process sent log output like this.

Failed to start riak_kv_eleveldb_backend backend for index 43388329858261180384171893471264652146035497

For the solution, we decided to remove the broken index from the data directory and shift the responsibility for index recovery to anti-entropy processes. When I say "remove" I mean like the command find ./ -name 43388329858261180384171893471264652146035497 | xargs -n1 rm -r for the example above. So indexes have been removed from the tictac_aae and leveldb directories. After removing the corrupted index node became available and continued work.

martinsumner commented 3 months ago

The really odd thing is that the fallback count > 0. This implies that when you're doing the read repair, there's two primary nodes responding, and one fallback. In a healthy cluster we would expect only primaries to respond.

What does riak admin cluster status say in your cluster (may be worth validating that all nodes say the same thing).

azowarg commented 3 months ago

Installed versions:
* 3.2.2-nhse    permanent

Ring ready: true

+-------------------------------+------+-------+-----+-------+
|             node              |status| avail |ring |pending|
+-------------------------------+------+-------+-----+-------+
|     riak@riak-0.riak-headless |valid |  up   | 15.6|  --   |
| (C) riak@riak-1.riak-headless |valid |  up   | 15.6|  --   |
|     riak@riak-2.riak-headless |valid |  up   | 17.2|  --   |
|     riak@riak-3.riak-headless |valid |  up   | 17.2|  --   |
|     riak@riak-4.riak-headless |valid |  up   | 17.2|  --   |
|     riak@riak-5.riak-headless |valid |  up   | 17.2|  --   |
+-------------------------------+------+-------+-----+-------+

Key: (C) = Claimant; availability marked with '!' is unexpected
ok

Same output for all 6 nodes. Perhaps fallback counters are from the period of "join/leave" nodes testing case. Right now counters are not increasing.

After it became a cluster member we started to exclude the remaining nodes from the cluster one by one. I mean, we have taken a node and left it from the cluster, and after successfully ring rebuild, we have joined this node to the cluster again. The first node has finished the leave/join process in 4 days and remains approximately in 24-36 hours

martinsumner commented 3 months ago

So none of the read repair counters are increasing? i.e. AAE is prompting a repair, but then a repair doesn't happen?

There is a second environment variable - riak_kv read_repair_log. If you set this to true, this will provide an additional log in the read repair process - https://github.com/nhs-riak/riak_kv/blob/nhse-develop-3.4/src/riak_kv_get_fsm.erl#L713-L717. If you turn that on, and then don't see any such logs, this will confirm that the prompt of read repair doesn't then trigger read repair.

azowarg commented 3 months ago

I'm sorry for the confusion, I mean only primary counters are increasing right now. Fallbacks are still the same.

azowarg commented 3 months ago

New log seems normal. Many keys in diffent vnodes.

Read repair of {<<"bucket">>,<<"key_1">>} on 1415829711164312202009819681693899175291684651008 'riak@riak-3.riak-headless' for reason notfound
Read repair of {<<"bucket">>,<<"key_2">>} on 1415829711164312202009819681693899175291684651008 'riak@riak-3.riak-headless' for reason notfound
Read repair of {<<"bucket">>,<<"key_3">>} on 1415829711164312202009819681693899175291684651008 'riak@riak-3.riak-headless' for reason notfound
Read repair of {<<"bucket">>,<<"key_4">>} on 1415829711164312202009819681693899175291684651008 'riak@riak-3.riak-headless' for reason notfound
Read repair of {<<"bucket">>,<<"key_5>>} on 1415829711164312202009819681693899175291684651008 'riak@riak-3.riak-headless' for reason notfound
Read repair of {<<"bucket">>,<<"key_6">>} on 1415829711164312202009819681693899175291684651008 'riak@riak-3.riak-headless' for reason notfound

martinsumner commented 3 months ago

So one possibility is that repair is working, and it is just taking time to work through all the differences - as opposed to just churning through the same keys over and over again.

So do you see the same keys being repeatedly appearing in the logs, or is each key being returned for repair once and only once.

There is a log EX003 which will be in the erlang.log file (we're interested in the case where in_sync is false). For each exchange it records the mismatched_segments as well as the keys repaired. This mismatched_segments is an estimate of how much repair work is still required to be done. So we want this value to be trending downwards ... but it is tricky to track, as different partitions will be in different states of disrepair, and it doesn't log which partition it is doing at which time.

If we can establish that this is working, but just taking time, the time can be reduced by a couple of ways (most notably by triggering range_repair). However, if it isn't working (it is just the same keys recycling), then this is a different scenario to troubleshoot.

martinsumner commented 3 months ago

If you want to correlate the mismatched_segments to a given vnode exchange using the exchange ID present in the EX003 log, there is a debug log which can be enabled - https://github.com/nhs-riak/riak_kv/blob/nhse-develop-3.4/src/riak_kv_tictacaae_repairs.erl#L108-L110.

azowarg commented 2 months ago

I enabled the log you mentioned above. However, I am not sure that I interpreted the information correctly. I filtered exchange IDs for one direction for one pair of vnodes. For example

Jul 9, 2024 @ 14:09:27.846 | Exchange prompted with exchange_id=9e93291e-7f0f-4147-8b4f-724b5387de44 between {182687704666362864775460604089535377456991567872,'riak@riak-0.riak-headless'} and {137015778499772148581595453067151533092743675904,'riak@riak-2.riak-headless'}
Jul 9, 2024 @ 14:08:53.028 | Exchange prompted with exchange_id=6e950f51-c3e7-4aa9-acae-bf742f8923a2 between {182687704666362864775460604089535377456991567872,'riak@riak-0.riak-headless'} and {137015778499772148581595453067151533092743675904,'riak@riak-2.riak-headless'}
Jul 9, 2024 @ 14:08:14.045 | Exchange prompted with exchange_id=2031ed35-9222-44fc-b922-2413f80651ae between {182687704666362864775460604089535377456991567872,'riak@riak-0.riak-headless'} and {137015778499772148581595453067151533092743675904,'riak@riak-2.riak-headless'}
Jul 9, 2024 @ 14:06:43.972 | Exchange prompted with exchange_id=6e741e4d-c905-4912-8566-edb35a929ac9 between {182687704666362864775460604089535377456991567872,'riak@riak-0.riak-headless'} and {137015778499772148581595453067151533092743675904,'riak@riak-2.riak-headless'}
Jul 9, 2024 @ 14:05:10.814 | Exchange prompted with exchange_id=670facbf-66c4-4536-838e-0dcc5ebb676c between {182687704666362864775460604089535377456991567872,'riak@riak-0.riak-headless'} and {137015778499772148581595453067151533092743675904,'riak@riak-2.riak-headless'}

Then I filtered for them in erlang.log

root@riak-0:/var/lib/riak# grep -F -f 123 ../../log/riak/erlang.log.1 | grep in_sync=false  | awk '{print $2, $13, $16, $19, $22}'
2024-07-09T08:06:39.029 pending_state=clock_compare id=a2662ac3-5c2a-4d7a-9ed6-ba55696a64ed mismatched_segments=985364 keys_passed_for_repair=411
2024-07-09T08:07:59.188 pending_state=clock_compare id=462e0d2f-fe04-4ad5-b7db-843bd84e38cd mismatched_segments=987172 keys_passed_for_repair=0
2024-07-09T08:54:37.097 pending_state=clock_compare id=e6fedabf-52b8-42a1-a0e9-4c3d18623d38 mismatched_segments=963155 keys_passed_for_repair=304
2024-07-09T08:54:53.323 pending_state=clock_compare id=feb7e4ad-bd47-46c4-a3d3-b55e62729d87 mismatched_segments=1039457 keys_passed_for_repair=104
2024-07-09T08:55:08.056 pending_state=clock_compare id=2d1f2ffe-958f-4b5a-9d1b-3a230ef57818 mismatched_segments=990598 keys_passed_for_repair=97
2024-07-09T08:55:22.424 pending_state=clock_compare id=31a01a5d-b1fd-4488-a6fd-276d35920977 mismatched_segments=1039457 keys_passed_for_repair=118
2024-07-09T08:55:34.365 pending_state=clock_compare id=db90f922-2f2d-4449-9261-11ab1e3fda5d mismatched_segments=1039457 keys_passed_for_repair=94
2024-07-09T09:58:37.629 pending_state=clock_compare id=365f3e05-4482-4368-b200-228e9c8de43c mismatched_segments=854269 keys_passed_for_repair=553
2024-07-09T10:00:07.307 pending_state=clock_compare id=b7a27c6d-b3be-4ab7-a9d7-6aeefc7d64d8 mismatched_segments=761469 keys_passed_for_repair=352
2024-07-09T10:01:01.638 pending_state=clock_compare id=b268bfed-1796-4094-a736-0d9b41cdea0f mismatched_segments=917088 keys_passed_for_repair=256
2024-07-09T10:01:40.534 pending_state=clock_compare id=e5fae04f-dde5-455a-9948-5f388c633528 mismatched_segments=900336 keys_passed_for_repair=359
2024-07-09T10:02:38.947 pending_state=clock_compare id=bb9be016-13ec-45f2-a308-b94ea4b96836 mismatched_segments=1039457 keys_passed_for_repair=346
2024-07-09T10:30:35.603 pending_state=clock_compare id=ddafb2dc-8a67-4133-86f2-e1679be1cd16 mismatched_segments=906936 keys_passed_for_repair=508
2024-07-09T10:32:21.332 pending_state=clock_compare id=83b99b7a-bd64-4977-b63c-e04681f63555 mismatched_segments=828127 keys_passed_for_repair=230
2024-07-09T10:33:14.240 pending_state=clock_compare id=039793dc-1b6b-4a16-8a62-d79820349ad1 mismatched_segments=1039457 keys_passed_for_repair=360
2024-07-09T10:33:48.976 pending_state=clock_compare id=76c8200c-b30f-47ff-9cc0-68d448f83885 mismatched_segments=1039299 keys_passed_for_repair=385
2024-07-09T10:34:18.032 pending_state=clock_compare id=1eaed470-7731-49cf-9aa6-6b430a75218c mismatched_segments=690370 keys_passed_for_repair=66
2024-07-09T11:18:36.318 pending_state=clock_compare id=1756c0e4-e07b-4887-9b7e-e4c552e0d531 mismatched_segments=837962 keys_passed_for_repair=551
2024-07-09T11:19:13.093 pending_state=clock_compare id=409c02cc-cccd-4bc0-a12b-2c480b0e126f mismatched_segments=877715 keys_passed_for_repair=131
2024-07-09T11:19:47.503 pending_state=clock_compare id=d6bccf7a-698a-4cf4-9aea-7b99c05d6307 mismatched_segments=1013600 keys_passed_for_repair=301
2024-07-09T11:20:06.434 pending_state=clock_compare id=4cc2103b-ec04-44f2-93d0-51f52f1a2e2f mismatched_segments=919943 keys_passed_for_repair=38
2024-07-09T12:22:41.647 pending_state=clock_compare id=c336dbc9-bb64-4bdc-b0c0-052186888870 mismatched_segments=944373 keys_passed_for_repair=277
2024-07-09T12:23:58.234 pending_state=clock_compare id=6dff7121-e094-499d-85b0-6a7ec4bb104a mismatched_segments=1039330 keys_passed_for_repair=117
2024-07-09T12:25:10.992 pending_state=clock_compare id=fc4b9070-1f7b-4b33-87b0-a3731b1be797 mismatched_segments=1039426 keys_passed_for_repair=89
2024-07-09T12:25:24.430 pending_state=clock_compare id=2894df2b-b899-47d5-a0bf-69413874edad mismatched_segments=784947 keys_passed_for_repair=95
2024-07-09T12:25:54.977 pending_state=clock_compare id=34de1a02-125e-488e-bbe5-da6d2387159a mismatched_segments=1039426 keys_passed_for_repair=93
2024-07-09T13:34:33.017 pending_state=clock_compare id=2110f771-2dbb-4347-9b51-3652af1bab33 mismatched_segments=591954 keys_passed_for_repair=285
2024-07-09T13:34:48.154 pending_state=clock_compare id=b31e5d4e-d7fc-48f9-88a1-548ab2181501 mismatched_segments=718162 keys_passed_for_repair=118
2024-07-09T13:35:02.519 pending_state=clock_compare id=7ae99cc5-7eb2-42dd-97e1-7f567e15adc8 mismatched_segments=1039204 keys_passed_for_repair=86
2024-07-09T13:35:12.635 pending_state=clock_compare id=2595be80-9f01-4c35-aeee-aa3c2f6bbc10 mismatched_segments=798716 keys_passed_for_repair=53
2024-07-09T13:37:18.388 pending_state=clock_compare id=36e0f139-96d4-42b7-a4bb-c56b78cfcecc mismatched_segments=968263 keys_passed_for_repair=1035
2024-07-09T14:06:43.972 pending_state=clock_compare id=670facbf-66c4-4536-838e-0dcc5ebb676c mismatched_segments=751539 keys_passed_for_repair=620
2024-07-09T14:08:14.046 pending_state=clock_compare id=6e741e4d-c905-4912-8566-edb35a929ac9 mismatched_segments=775683 keys_passed_for_repair=370
2024-07-09T14:08:53.028 pending_state=clock_compare id=2031ed35-9222-44fc-b922-2413f80651ae mismatched_segments=1037363 keys_passed_for_repair=384
2024-07-09T14:09:27.846 pending_state=clock_compare id=6e950f51-c3e7-4aa9-acae-bf742f8923a2 mismatched_segments=1037395 keys_passed_for_repair=375
2024-07-09T14:09:56.957 pending_state=clock_compare id=9e93291e-7f0f-4147-8b4f-724b5387de44 mismatched_segments=769211 keys_passed_for_repair=85

Do I understand correctly that the mismatched_segments value should change more linearly and not have these "bounces". Or this is ok?

martinsumner commented 2 months ago

I would expect it to be more linear than this, but mismatched_segments is an estimate, so it can't be guaranteed. However, this doesn't look particularly healthy.

Do you have multiple pairs of vnodes with the same behaviour on exchanges?

martinsumner commented 2 months ago

For the read repairs prompted by this exchange (i.e. vnode pair) is the not_found always the same vnode of the two? Is that a vnode you expect to be being repaired by AAE?

martinsumner commented 2 months ago

If we assume that this is just the AAE repair taking a long time, you can accelerate repair for a vnode that you deleted using partition repair- https://docs.riak.com/riak/kv/latest/using/repair-recovery/repairs/index.html#repairing-a-single-partition (use 'riak remote_consolenotriak attach`). This is a lot quicker than AAE.

azowarg commented 2 months ago

@martinsumner Thanks for your advice. We finally fixed data corruption in our cluster. As you say above we manually run the partition repairs function on several of them. After that, we deleted old tictacaae directories to force rebuild aae trees and to free some disk space. After the successful rebuild was completed, the remaining delta was eliminated within 3 days. All exchanges now are in a state in_sync=true pending_state=root_compare and memory consumption stopped, riak started slowly to free used memory. Now, the tiactacaae directory has an average size of half the size of leveldb dir.