Open TeadRIM opened 1 year ago
This isn't normal or expected behaviour.
Couple of things which might clarify what is occurring:
delete_mode
fetch_clocks_range
fold is there a difference between the folds, and if so is there any pattern to that difference.Thanks for the quick response
Tictac AAE loop completed for partition=570899077082383952423314387779798054553098649600 with exchanges expected=6 exchanges completed=6 total deltas=0 total exchange_time=5 seconds loop duration=3360 seconds (elapsed)
Tictac AAE loop completed for partition=844930634081928249586505293914101120738586001408 with exchanges expected=6 exchanges completed=6 total deltas=0 total exchange_time=6 seconds loop duration=3360 seconds (elapsed)
Tictac AAE loop completed for partition=936274486415109681974235595958868809467081785344 with exchanges expected=6 exchanges completed=6 total deltas=0 total exchange_time=6 seconds loop duration=3360 seconds (elapsed)
on (4), is there a pattern to which keys disappear/reappear? Is it the same subset of keys that vary in terms of their presence in responses, or does it seem to be that any key can disappear/reappear at random?
Made 5 tries. There are keys that are always present. The rest continue their strange behavior.
What proportion are always present, and what proportion exhibit the strange behaviour?
For a key that has the strange behaviour, when it is returned is the vector clock always the same? Can you share an example of the clock returned (by fetch_clocks_range) for an always present key, and a varying key?
Here is an example of a key request that is always present.
{<<"bucket">>,<<"alwayspresentkey1">>,
[{<<124,245,8,174,100,178,194,135>>,{11,63850161605}},
{<<197,14,211,70,140,232,113,37>>,{3,63850161605}},
{<<212,115,185,63,208,83,228,107,0,...>>,{16,63850161605}}]},
---
{<<"bucket">>,<<"alwayspresentkey1">>,
[{<<124,245,8,174,100,178,194,135>>,{11,63850161605}},
{<<197,14,211,70,140,232,113,37>>,{3,63850161605}},
{<<212,115,185,63,208,83,228,107,0,46,49,...>>,
{16.63850161605}}]},
---
{<<"bucket">>,<<"alwayspresentkey1">>,
[{<<124,245,8,174,100,178,194,135>>,{11,63850161605}},
{<<197,14,211,70,140,232,113,37>>,{3,63850161605}},
{<<212,115,185,63,208,83,228,107,0,46,49,72>>,
{16.63850161605}}]},
---
{<<"bucket">>,<<"alwayspresentkey1">>,
[{<<124,245,8,174,100,178,194,135>>,{11,63850161605}},
{<<197,14,211,70,140,232,113,37>>,{3,63850161605}},
{<<"\324s\271?\320S\344k"...>>,{16.63850161605}}]},
---
{<<"bucket">>,<<"alwayspresentkey1">>,
[{<<124,245,8,174,100,178,194,135>>,{11,63850161605}},
{<<197,14,211,70,140,232,113,37>>,{3,63850161605}},
{<<212,115,185,63,208,83,228,107,0,46,49,...>>,
{16.63850161605}}]},
---
{<<"bucket">>,<<"alwayspresentkey2">>,
[{<<124,245,8,174,100,178,194,135,0,41,232,223>>,
{1.63850161580}},
{<<109,136,98,215,138,200,26,221>>,{24,63850161605}},
{<<212,115,185,63,164,208,25,139>>,{5,63850161605}}]},
---
{<<"bucket">>,<<"alwayspresentkey2">>,
[{<<124,245,8,174,100,178,194,135,0,41,232,223>>,
{1.63850161580}},
{<<109,136,98,215,138,200,26,221>>,{24,63850161605}},
{<<212,115,185,63,164,208,25,139>>,{5,63850161605}}]},
Here are the keys that behave strangely
{<<"bucket">>,<<"disappearingkey1">>,
[{<<212,115,185,63,126,252,212,117,0,12,118,8>>,
{24.63831325343}},
{<<0.151,113.31>>,{1.63831415122}},
{<<2,226,39,102>>,{1,63831415122}}]},
---
{<<"bucket">>,<<"disappearingkey1">>,
[{<<212,115,185,63,126,252,212,117,0,12,118,8>>,
{24.63831325343}},
{<<0.151,113.31>>,{1.63831415122}},
{<<2,226,39,102>>,{1,63831415122}}]},
---
{<<"bucket">>,<<"disappearingkey2">>,
[{<<231,244,7,180,224,103,27,26>>,{1,63831325336}},
{<<212,115,185,63,167,202,161,81,0,12,194,40>>,
{21.63831325361}},
{<<197,14,211,70,159,252,140,165>>,{4,63831325561}},
{<<2,226,39,102>>,{2,63831415122}}]},
---
{<<"bucket">>,<<"disappearingkey2">>,
[{<<231,244,7,180,224,103,27,26>>,{1,63831325336}},
{<<212,115,185,63,167,202,161,81,0,12,194,40>>,
{21.63831325361}},
{<<197,14,211,70,159,252,140,165>>,{4,63831325561}},
{<<2,226,39,102>>,{2,63831415122}}]},
I didn't notice that the strange keys were any different
Normally I expect to see a VCLOCK entry as something like:
{<<212,115,185,63,167,202,161,81,0,12,194,40>>,{21,63831325361}}
which is {vnode_id, {Counter, Timestamp}}
but you have some like:
{<<212,115,185,63,167,202,161,81,0,12,194,40>>,{21.63831325361}}
so a "," has become a "." between the counter and the timestamp - is this just a transcribing error?
Assuming that to be true, there are still some strange things about the clocks on the disappearing keys:
<<212,115 ...
) are possible when there has been an issue with key amnesia.All of the above may be red herrings. Non-standard PUT paths may generate funny vnode ids (the write once path does, but not I believe of this form).
Getting a few more dissappearingkey clocks may help me understand if this weirdness is relevant to the actual problem.
P.S. The timestamp for the coordination of change in the vector clock (e.g. 63831325561
) can be reversed using the calendar:gregorian_seconds_to_datetime/1
function in Erlang.
Tomorrow, if I get chance, I'm going to try and extend one of the riak_test
tests to try and recreate this instability.
I have this feature tested in prod environments with o(1bn) keys producing reliable and predictable results, so I don't think the feature is fundamentally flawed - but there may be a sequence of events combined with particular feature combinations that may be exposing a bug.
Any thing you can suggest that might make this test more realistic to your environment would be helpful. e.g. what PUT options do you use, what are the bucket properties you have, any noticeable events that may have occurred in the past etc (e.g. particular failures), any uncommon features that you use.
I noticed that the missing keys have approximately the same time 26 Sep 2022. Regardless of what period of time I request. I requested the first keys for May 5
riak_client:aae_fold({fetch_clocks_range, <<"bucket">>, all, all, {date, {{2023, 5, 1}, {12, 0, 5}}, {{2023, 5, 1}, {12, 0, 6}}}}).
The second time I requested for March
riak_client:aae_fold({fetch_clocks_range, <<"bucket">>, all, all, {date, {{2023, 4, 1}, {12, 0, 5}}, {{2023, 4, 1}, {12, 0, 6}}}}).
And I saw the same keys as in May
---
{<<"bucket">>,<<"disappearingkey1">>,
[{<<212,115,185,63,126,252,212,117,0,12,118,8>>,
{24.63831325343}},
{<<0.151,113.31>>,{1.63831415122}},
{<<2,226,39,102>>,{1,63831415122}}]},
--- And keys dated September
{<<"bucket">>,<<"disappearingkey3">>,
[{<<109,136,98,215,138,200,26,221>>,{2,63831325679}},
{<<212,115,185,63,164,208,25,139,0,12,35,33>>,
{22.63831325683}},
{<<0.151,113.31>>,{1.63831415123}},
{<<2,226,39,102>>,{1,63831415123}}]},
--- And keys for the requested period of time
{<<"bucket">>,<<"alwayspresentkey3">>,
[{<<197,14,211,70,159,252,140,165>>,{4,63850161585}},
{<<231,244,7,180,224,103,27,26>>,{2,63850161605}},
{<<109,136,98,215,138,198,59,...>>,{24,63850161606}}]},
Mostly the disappearing keys belong to <<2,226,39,102>>. There are others
{<<"bucket">>,<<"disappearingkey4">>,
[{<<"\324s\271?y\276e\253">>,{1.63831325470}},
{<<109,136,98,215,109,134,234,195>>,{1,63831412730}},
{<<124,245,8,174,26,199,201,241,0,10,124,88>>,
{24.63831412730}}]},
---
{<<"bucket">>,<<"disappearingkey4">>,
[{<<197,14,211,70,73,139,121,125,0,9,251,104>>,
{24.63831325449}},
{<<0.151,113.31>>,{1.63831415123}},
{<<"\324s\271?~\374\324u">>,{1.63831415123}}]},
Not sure if this is related, but there is a LOCK that coincides in time
-rw-rw-r--. 1 riak riak 0 Sep 26 2022 leveldb/228359630832953580969325755111919221821239459840/LOCK
In addition to this, there is also one for July.
-rw-rw-r--. 1 riak riak 0 Jul 6 2022 leveldb/1050454301831586472458898473514828420377701515264/LOCK
-rw-rw-r--. 1 riak riak 0 Jul 6 2022 leveldb/1141798154164767904846628775559596109106197299200/LOCK
-rw-rw-r--. 1 riak riak 0 Jul 6 2022 leveldb/1233142006497949337234359077604363797834693083136/LOCK
-rw-rw-r--. 1 riak riak 0 Jul 6 2022 leveldb/1324485858831130769622089379649131486563188867072/LOCK
-rw-rw-r--. 1 riak riak 0 Jul 6 2022 leveldb/137015778499772148581595453067151533092743675904/LOCK
-rw-rw-r--. 1 riak riak 0 Jul 6 2022 leveldb/1415829711164312202009819681693899175291684651008/LOCK
Not sure what happened in September. But in July 2022 we moved from gp2 to gp3, re-creating the riak nodes. In May 2023, the riak version was updated from 2.9.0 to 3.0.12 We had cases when the riak node with the log failed
{"@severity":"error","@timestamp":"2022-02-10T13:19:58.327336Z","message":"Failed to start riak_kv_eleveldb_backend backend for index 43388329858261180384171893471264652146035497 "}
Typically we simply remove the failing index. Maybe this is the wrong approach
If you fetch the disappearing keys via the a HTTP request, what is the Last Modified Date on the object?
< Last-Modified: Mon, 26 Sep 2022 11:58:50 GMT
I looked at the underlying code of the leveled store, as to how it qualifies something as being within the last modified range:
https://github.com/martinsumner/leveled/blob/develop-3.0/src/leveled_penciller.erl#L1712-L1715
If the object in the parallel AAE store has a LMD of undefined
it will be returned in any query regardless of the last-modified-date range used. However, there is also an "accelerator" for the query which skips over any file in the tree if the highest last-modified date in the file is before the low last-modified date in the query. Whether this skip occurs can be variable, so if somehow an object reference has ended in a leveled backend with a LMD of undefined
it could in theory appear intermittently in query responses.
What I don't know is how it might get to be undefined
in the first place. I think the undefined
state was just added for backwards compatibility with an older object format - but if you recently switched on TictacAAE, I don't see why that would be an issue.
Also, the AAE stores are periodically rebuilt based on the tictacaae_rebuildwait
config setting (which is in hours).
I will have a thing what else to try. It may be worth touching a disappearing object (i.e. GET it then PUT it back as-is). This should reset the LMD, and it would be interesting to see if that stops it from appearing in the wrong range.
The other thing would be to test the object to see if the function that extract the LMD works correctly on it. If you could do this via riak remote_console
(obviously, using your Bucket and a disappearing Key):
K1 = <<"K000001">>.
B = <<"test_bucket">>.
{ok, C} = riak:local_client().
AAEFun = riak_object:aae_from_object_binary(true).
element(4, AAEFun(riak_object:to_binary(v1, element(2, riak_client:get(B, K1, C))))).
I tried it on several disappearing keys and got
(riak@riak-0.riak-headless)11> element(4, AAEFun(riak_object:to_binary(v1, element(2, riak_client:get(B, K1, C))))).
[{1664,195922,783402}]
(riak@riak-0.riak-headless)13> element(4, AAEFun(riak_object:to_binary(v1, element(2, riak_client:get(B, K2, C))))).
[{1664,195922,923965}]
Which is the correct LMD:
calendar:now_to_datetime({1664,195922,783402}).
{{2022,9,26},{12,38,42}}
So this doesn't explain why we could get an undefined
LMD.
Can you confirm you have the default value for tictacaae_rebuildwait
in riak.conf (which should be 336 hours)? Also could you try and GET/PUT one of the disappearing keys and see if this resolves the behaviour for that key?
This might be related. there are two binary formats for Riak objects - v0 and v1. Before storing an object Riak checks that the cluster supports v1:
https://github.com/basho/riak_kv/blob/develop-3.0/src/riak_kv_vnode.erl#L4094-L4101
This caused problems in leveled, which only supports v1 of the binary format. The way capabilities are negotiated could lead to sometimes v0 being chosen during certain failure scenarios (hence with the leveled backend it is now forced to be always v1).
The elevledb backend doesn't have that protection. So possibly during an incident, an object could be stored in the v0 binary format. If that happens though, the binary format doesn't included the last-modified-date so when the update is put into the AAE backend this (and other) metadata might end up missing.
element(4, AAEFun(riak_object:to_binary(v0, element(2, riak_client:get(B, K1, C))))).
undefined
So I suspect this is what happened here. You have some objects stored during an incident on 26th September 2022, and they now exist in the eleveldb backend in v0 format. So now when we build the tictac aae stores they are added to the key store with a LMD of undefined - and hence they have unpredictable behaviour when running aae_folds with modified date ranges.
The periodic rebuilds of the AAE store won't help as they will keep rebuilding based on the decoding of the v0 format. the only thing that will help will be to do an inert update on the object, outside of the failure, so that it is replaced in the backend with a v1 version.
I think there is an underlying issue with negotiating things like object version via riak_core capabilities, especially as support for v0 is now so legacy there is no possibility of being in a mixed cluster with something requiring v0.
The riak.conf file sets the preferred object format:
https://github.com/basho/riak_kv/blob/riak_kv-3.0.16/priv/riak_kv.schema#L529-L548
However the default format used, when not all cluster members can confirm the v1 capability is to use v0:
https://github.com/basho/riak_kv/blob/riak_kv-3.0.16/src/riak_kv_app.erl#L193-L195
Really this should default to the configured setting, given that cluster support should be guaranteed now.
However, the riak_object:aae_from_object_binary/1
needs to be updated to better handle v0 objects. If this was to occur the problem with these objects would be resolved at the next tictacaae rebuild.
So really, in this scenario where we discover a v0 binary:
https://github.com/basho/riak_kv/blob/riak_kv-3.0.16/src/riak_object.erl#L1275-L1277
Rather than returning an incomplete set of metadata including an undefined
last-modified-date, we should instead convert to v1 binary before decoding i.e.:
summary_from_binary(to_binary(v1, T))
Hi. Some context. We use riak version 3.12, leveldb, ring_size = 64, n_val = 3, 5 riak nodes. We tried switching to Tictac AAE to use NextGen replication. The following was added to the config:
After tictac aae has been built, we observe different statistics on objects through
riak_client:aae_fold({object_stats, <<"domainRecord">>, all, all}).
on immutable buckets (no external traffic enters riak). There are more than 10kk objects in the bucket, and object_stats shows a different value each time with a slight difference. erase_keys also gives different statistics. For example (the request was repeated with a difference of 5 minutes):Over a longer period of time the difference will be greater. Also, when real-time replication is enabled, object_stats on the sink cluster and the source cluster will be different. There are no errors in the logs. Can you give any comments? Is this normal behavior? Due to the lack of tools for checking data integrity, we cannot be sure that consistency between the two clusters will be maintained and we cannot reliably use replication.