basho / riak_kv

Riak Key/Value Store
Apache License 2.0
653 stars 233 forks source link

Tictac AAE object_stats shows various statistics over a period of time. #1874

Open TeadRIM opened 1 year ago

TeadRIM commented 1 year ago

Hi. Some context. We use riak version 3.12, leveldb, ring_size = 64, n_val = 3, 5 riak nodes. We tried switching to Tictac AAE to use NextGen replication. The following was added to the config:

       anti_entropy = passive
       tictacaae_active = active
       tictacaae_storeheads = enabled

After tictac aae has been built, we observe different statistics on objects through riak_client:aae_fold({object_stats, <<"domainRecord">>, all, all}). on immutable buckets (no external traffic enters riak). There are more than 10kk objects in the bucket, and object_stats shows a different value each time with a slight difference. erase_keys also gives different statistics. For example (the request was repeated with a difference of 5 minutes):

riak_client:aae_fold({object_stats, <<"any_bucket">>, all, {date, {{2023, 8, 5}, {12, 0, 0}}, {{2023, 8, 5}, {13, 0, 0}}}}).
{ok,[{total_count,3652},
      {total_size,19184471},
      {sizes,[{2,105},{3,3156},{4,391}]},
      {siblings,[{1,3652}]}]}
riak_client:aae_fold({object_stats, <<"any_bucket">>, all, {date, {{2023, 8, 5}, {12, 0, 0}}, {{2023, 8, 5}, {13, 0, 0}}}}).
{ok,[{total_count,3645},
      {total_size,19145341},
      {sizes,[{2,105},{3,3149},{4,391}]},
      {siblings,[{1,3645}]}]}
riak_client:aae_fold({object_stats, <<"any_bucket">>, all, {date, {{2023, 8, 5}, {12, 0, 0}}, {{2023, 8, 5}, {13, 0, 0}}}}).
{ok,[{total_count,3643},
      {total_size,19132325},
      {sizes,[{2,105},{3,3147},{4,391}]},
      {siblings,[{1,3643}]}]}
riak_client:aae_fold({object_stats, <<"any_bucket">>, all, {date, {{2023, 8, 5}, {12, 0, 0}}, {{2023, 8, 5}, {13, 0, 0}}}}).
{ok,[{total_count,3650},
      {total_size,19171870},
      {sizes,[{2,105},{3,3154},{4,391}]},
      {siblings,[{1,3650}]}]}

Over a longer period of time the difference will be greater. Also, when real-time replication is enabled, object_stats on the sink cluster and the source cluster will be different. There are no errors in the logs. Can you give any comments? Is this normal behavior? Due to the lack of tools for checking data integrity, we cannot be sure that consistency between the two clusters will be maintained and we cannot reliably use replication.

martinsumner commented 1 year ago

This isn't normal or expected behaviour.

Couple of things which might clarify what is occurring:

TeadRIM commented 1 year ago

Thanks for the quick response

  1. The log looks like
     Tictac AAE loop completed for partition=570899077082383952423314387779798054553098649600 with exchanges expected=6 exchanges completed=6 total deltas=0 total exchange_time=5 seconds loop duration=3360 seconds (elapsed)
     Tictac AAE loop completed for partition=844930634081928249586505293914101120738586001408 with exchanges expected=6 exchanges completed=6 total deltas=0 total exchange_time=6 seconds loop duration=3360 seconds (elapsed)
     Tictac AAE loop completed for partition=936274486415109681974235595958868809467081785344 with exchanges expected=6 exchanges completed=6 total deltas=0 total exchange_time=6 seconds loop duration=3360 seconds (elapsed)
  2. There are no key amnesia messages in the logs.
  3. We did not redefine delete_mode. We use secondary indices and we do not use ttl objects and do not delete data. The situation with objects also occurs with a cluster to which third-party services are not connected.
  4. fetch_clocks_range gives different results. Some objects remain, some disappear and new ones appear. The key may disappear and appear after some time.
martinsumner commented 1 year ago

on (4), is there a pattern to which keys disappear/reappear? Is it the same subset of keys that vary in terms of their presence in responses, or does it seem to be that any key can disappear/reappear at random?

TeadRIM commented 1 year ago

Made 5 tries. There are keys that are always present. The rest continue their strange behavior.

martinsumner commented 1 year ago

What proportion are always present, and what proportion exhibit the strange behaviour?

For a key that has the strange behaviour, when it is returned is the vector clock always the same? Can you share an example of the clock returned (by fetch_clocks_range) for an always present key, and a varying key?

TeadRIM commented 1 year ago

Here is an example of a key request that is always present.

      {<<"bucket">>,<<"alwayspresentkey1">>,
       [{<<124,245,8,174,100,178,194,135>>,{11,63850161605}},
        {<<197,14,211,70,140,232,113,37>>,{3,63850161605}},
        {<<212,115,185,63,208,83,228,107,0,...>>,{16,63850161605}}]},
---
      {<<"bucket">>,<<"alwayspresentkey1">>,
       [{<<124,245,8,174,100,178,194,135>>,{11,63850161605}},
        {<<197,14,211,70,140,232,113,37>>,{3,63850161605}},
        {<<212,115,185,63,208,83,228,107,0,46,49,...>>,
         {16.63850161605}}]},
---
      {<<"bucket">>,<<"alwayspresentkey1">>,
       [{<<124,245,8,174,100,178,194,135>>,{11,63850161605}},
        {<<197,14,211,70,140,232,113,37>>,{3,63850161605}},
        {<<212,115,185,63,208,83,228,107,0,46,49,72>>,
         {16.63850161605}}]},
---
      {<<"bucket">>,<<"alwayspresentkey1">>,
       [{<<124,245,8,174,100,178,194,135>>,{11,63850161605}},
        {<<197,14,211,70,140,232,113,37>>,{3,63850161605}},
        {<<"\324s\271?\320S\344k"...>>,{16.63850161605}}]},
---
      {<<"bucket">>,<<"alwayspresentkey1">>,
       [{<<124,245,8,174,100,178,194,135>>,{11,63850161605}},
        {<<197,14,211,70,140,232,113,37>>,{3,63850161605}},
        {<<212,115,185,63,208,83,228,107,0,46,49,...>>,
         {16.63850161605}}]},
---
      {<<"bucket">>,<<"alwayspresentkey2">>,
       [{<<124,245,8,174,100,178,194,135,0,41,232,223>>,
         {1.63850161580}},
        {<<109,136,98,215,138,200,26,221>>,{24,63850161605}},
        {<<212,115,185,63,164,208,25,139>>,{5,63850161605}}]},
---
      {<<"bucket">>,<<"alwayspresentkey2">>,
       [{<<124,245,8,174,100,178,194,135,0,41,232,223>>,
         {1.63850161580}},
        {<<109,136,98,215,138,200,26,221>>,{24,63850161605}},
        {<<212,115,185,63,164,208,25,139>>,{5,63850161605}}]},

Here are the keys that behave strangely

      {<<"bucket">>,<<"disappearingkey1">>,
       [{<<212,115,185,63,126,252,212,117,0,12,118,8>>,
         {24.63831325343}},
        {<<0.151,113.31>>,{1.63831415122}},
        {<<2,226,39,102>>,{1,63831415122}}]},
---
      {<<"bucket">>,<<"disappearingkey1">>,
       [{<<212,115,185,63,126,252,212,117,0,12,118,8>>,
         {24.63831325343}},
        {<<0.151,113.31>>,{1.63831415122}},
        {<<2,226,39,102>>,{1,63831415122}}]},
---
      {<<"bucket">>,<<"disappearingkey2">>,
       [{<<231,244,7,180,224,103,27,26>>,{1,63831325336}},
        {<<212,115,185,63,167,202,161,81,0,12,194,40>>,
         {21.63831325361}},
        {<<197,14,211,70,159,252,140,165>>,{4,63831325561}},
        {<<2,226,39,102>>,{2,63831415122}}]},
---
      {<<"bucket">>,<<"disappearingkey2">>,
       [{<<231,244,7,180,224,103,27,26>>,{1,63831325336}},
        {<<212,115,185,63,167,202,161,81,0,12,194,40>>,
         {21.63831325361}},
        {<<197,14,211,70,159,252,140,165>>,{4,63831325561}},
        {<<2,226,39,102>>,{2,63831415122}}]},

I didn't notice that the strange keys were any different

martinsumner commented 1 year ago

Normally I expect to see a VCLOCK entry as something like:

{<<212,115,185,63,167,202,161,81,0,12,194,40>>,{21,63831325361}}

which is {vnode_id, {Counter, Timestamp}}

but you have some like:

{<<212,115,185,63,167,202,161,81,0,12,194,40>>,{21.63831325361}}

so a "," has become a "." between the counter and the timestamp - is this just a transcribing error?

Assuming that to be true, there are still some strange things about the clocks on the disappearing keys:

All of the above may be red herrings. Non-standard PUT paths may generate funny vnode ids (the write once path does, but not I believe of this form).

Getting a few more dissappearingkey clocks may help me understand if this weirdness is relevant to the actual problem.

P.S. The timestamp for the coordination of change in the vector clock (e.g. 63831325561) can be reversed using the calendar:gregorian_seconds_to_datetime/1 function in Erlang.

martinsumner commented 1 year ago

Tomorrow, if I get chance, I'm going to try and extend one of the riak_test tests to try and recreate this instability.

I have this feature tested in prod environments with o(1bn) keys producing reliable and predictable results, so I don't think the feature is fundamentally flawed - but there may be a sequence of events combined with particular feature combinations that may be exposing a bug.

Any thing you can suggest that might make this test more realistic to your environment would be helpful. e.g. what PUT options do you use, what are the bucket properties you have, any noticeable events that may have occurred in the past etc (e.g. particular failures), any uncommon features that you use.

TeadRIM commented 1 year ago

I noticed that the missing keys have approximately the same time 26 Sep 2022. Regardless of what period of time I request. I requested the first keys for May 5 riak_client:aae_fold({fetch_clocks_range, <<"bucket">>, all, all, {date, {{2023, 5, 1}, {12, 0, 5}}, {{2023, 5, 1}, {12, 0, 6}}}}). The second time I requested for March riak_client:aae_fold({fetch_clocks_range, <<"bucket">>, all, all, {date, {{2023, 4, 1}, {12, 0, 5}}, {{2023, 4, 1}, {12, 0, 6}}}}). And I saw the same keys as in May

---
      {<<"bucket">>,<<"disappearingkey1">>,
       [{<<212,115,185,63,126,252,212,117,0,12,118,8>>,
         {24.63831325343}},
        {<<0.151,113.31>>,{1.63831415122}},
        {<<2,226,39,102>>,{1,63831415122}}]},
--- And keys dated September
      {<<"bucket">>,<<"disappearingkey3">>,
       [{<<109,136,98,215,138,200,26,221>>,{2,63831325679}},
        {<<212,115,185,63,164,208,25,139,0,12,35,33>>,
         {22.63831325683}},
        {<<0.151,113.31>>,{1.63831415123}},
        {<<2,226,39,102>>,{1,63831415123}}]},
--- And keys for the requested period of time
      {<<"bucket">>,<<"alwayspresentkey3">>,
       [{<<197,14,211,70,159,252,140,165>>,{4,63850161585}},
        {<<231,244,7,180,224,103,27,26>>,{2,63850161605}},
        {<<109,136,98,215,138,198,59,...>>,{24,63850161606}}]},

Mostly the disappearing keys belong to <<2,226,39,102>>. There are others

      {<<"bucket">>,<<"disappearingkey4">>,
       [{<<"\324s\271?y\276e\253">>,{1.63831325470}},
        {<<109,136,98,215,109,134,234,195>>,{1,63831412730}},
        {<<124,245,8,174,26,199,201,241,0,10,124,88>>,
         {24.63831412730}}]},
---
      {<<"bucket">>,<<"disappearingkey4">>,
       [{<<197,14,211,70,73,139,121,125,0,9,251,104>>,
         {24.63831325449}},
        {<<0.151,113.31>>,{1.63831415123}},
        {<<"\324s\271?~\374\324u">>,{1.63831415123}}]},

Not sure if this is related, but there is a LOCK that coincides in time

  -rw-rw-r--. 1 riak riak 0 Sep 26 2022 leveldb/228359630832953580969325755111919221821239459840/LOCK

In addition to this, there is also one for July.

-rw-rw-r--. 1 riak riak 0 Jul 6 2022 leveldb/1050454301831586472458898473514828420377701515264/LOCK
-rw-rw-r--. 1 riak riak 0 Jul 6 2022 leveldb/1141798154164767904846628775559596109106197299200/LOCK
-rw-rw-r--. 1 riak riak 0 Jul 6 2022 leveldb/1233142006497949337234359077604363797834693083136/LOCK
-rw-rw-r--. 1 riak riak 0 Jul 6 2022 leveldb/1324485858831130769622089379649131486563188867072/LOCK
-rw-rw-r--. 1 riak riak 0 Jul 6 2022 leveldb/137015778499772148581595453067151533092743675904/LOCK
-rw-rw-r--. 1 riak riak 0 Jul 6 2022 leveldb/1415829711164312202009819681693899175291684651008/LOCK

Not sure what happened in September. But in July 2022 we moved from gp2 to gp3, re-creating the riak nodes. In May 2023, the riak version was updated from 2.9.0 to 3.0.12 We had cases when the riak node with the log failed

{"@severity":"error","@timestamp":"2022-02-10T13:19:58.327336Z","message":"Failed to start riak_kv_eleveldb_backend backend for index 43388329858261180384171893471264652146035497 "}

Typically we simply remove the failing index. Maybe this is the wrong approach

martinsumner commented 1 year ago

If you fetch the disappearing keys via the a HTTP request, what is the Last Modified Date on the object?

TeadRIM commented 1 year ago

< Last-Modified: Mon, 26 Sep 2022 11:58:50 GMT

martinsumner commented 1 year ago

I looked at the underlying code of the leveled store, as to how it qualifies something as being within the last modified range:

https://github.com/martinsumner/leveled/blob/develop-3.0/src/leveled_penciller.erl#L1712-L1715

If the object in the parallel AAE store has a LMD of undefined it will be returned in any query regardless of the last-modified-date range used. However, there is also an "accelerator" for the query which skips over any file in the tree if the highest last-modified date in the file is before the low last-modified date in the query. Whether this skip occurs can be variable, so if somehow an object reference has ended in a leveled backend with a LMD of undefined it could in theory appear intermittently in query responses.

What I don't know is how it might get to be undefined in the first place. I think the undefined state was just added for backwards compatibility with an older object format - but if you recently switched on TictacAAE, I don't see why that would be an issue.

Also, the AAE stores are periodically rebuilt based on the tictacaae_rebuildwait config setting (which is in hours).

I will have a thing what else to try. It may be worth touching a disappearing object (i.e. GET it then PUT it back as-is). This should reset the LMD, and it would be interesting to see if that stops it from appearing in the wrong range.

The other thing would be to test the object to see if the function that extract the LMD works correctly on it. If you could do this via riak remote_console (obviously, using your Bucket and a disappearing Key):

K1 = <<"K000001">>.
B = <<"test_bucket">>.
{ok, C} = riak:local_client().
AAEFun = riak_object:aae_from_object_binary(true).
element(4, AAEFun(riak_object:to_binary(v1, element(2, riak_client:get(B, K1, C))))).
TeadRIM commented 1 year ago

I tried it on several disappearing keys and got

(riak@riak-0.riak-headless)11> element(4, AAEFun(riak_object:to_binary(v1, element(2, riak_client:get(B, K1, C))))).
[{1664,195922,783402}]

(riak@riak-0.riak-headless)13> element(4, AAEFun(riak_object:to_binary(v1, element(2, riak_client:get(B, K2, C))))).
[{1664,195922,923965}]
martinsumner commented 1 year ago

Which is the correct LMD:

calendar:now_to_datetime({1664,195922,783402}).               
{{2022,9,26},{12,38,42}}

So this doesn't explain why we could get an undefined LMD.

Can you confirm you have the default value for tictacaae_rebuildwait in riak.conf (which should be 336 hours)? Also could you try and GET/PUT one of the disappearing keys and see if this resolves the behaviour for that key?

martinsumner commented 1 year ago

This might be related. there are two binary formats for Riak objects - v0 and v1. Before storing an object Riak checks that the cluster supports v1:

https://github.com/basho/riak_kv/blob/develop-3.0/src/riak_kv_vnode.erl#L4094-L4101

This caused problems in leveled, which only supports v1 of the binary format. The way capabilities are negotiated could lead to sometimes v0 being chosen during certain failure scenarios (hence with the leveled backend it is now forced to be always v1).

The elevledb backend doesn't have that protection. So possibly during an incident, an object could be stored in the v0 binary format. If that happens though, the binary format doesn't included the last-modified-date so when the update is put into the AAE backend this (and other) metadata might end up missing.

element(4, AAEFun(riak_object:to_binary(v0, element(2, riak_client:get(B, K1, C))))).
undefined

So I suspect this is what happened here. You have some objects stored during an incident on 26th September 2022, and they now exist in the eleveldb backend in v0 format. So now when we build the tictac aae stores they are added to the key store with a LMD of undefined - and hence they have unpredictable behaviour when running aae_folds with modified date ranges.

The periodic rebuilds of the AAE store won't help as they will keep rebuilding based on the decoding of the v0 format. the only thing that will help will be to do an inert update on the object, outside of the failure, so that it is replaced in the backend with a v1 version.

martinsumner commented 1 year ago

I think there is an underlying issue with negotiating things like object version via riak_core capabilities, especially as support for v0 is now so legacy there is no possibility of being in a mixed cluster with something requiring v0.

The riak.conf file sets the preferred object format:

https://github.com/basho/riak_kv/blob/riak_kv-3.0.16/priv/riak_kv.schema#L529-L548

However the default format used, when not all cluster members can confirm the v1 capability is to use v0:

https://github.com/basho/riak_kv/blob/riak_kv-3.0.16/src/riak_kv_app.erl#L193-L195

Really this should default to the configured setting, given that cluster support should be guaranteed now.

However, the riak_object:aae_from_object_binary/1 needs to be updated to better handle v0 objects. If this was to occur the problem with these objects would be resolved at the next tictacaae rebuild.

So really, in this scenario where we discover a v0 binary:

https://github.com/basho/riak_kv/blob/riak_kv-3.0.16/src/riak_object.erl#L1275-L1277

Rather than returning an incomplete set of metadata including an undefined last-modified-date, we should instead convert to v1 binary before decoding i.e.:

summary_from_binary(to_binary(v1, T))
TeadRIM commented 1 year ago

Thank you for your help. Replying to comment tictacaae_rebuildwait is set to 336 After GET/PUT the key stopped behaving strangely.