Closed keynslug closed 4 years ago
Many thanks for the high quality investigation, and excellent write-up.
I have another bug I'm focused on at the moment, but will try and look at this as soon as possible. @martincox would you, or someone in your team, be able to look too?
My initial reaction if your hunch is correct, is that we should just fail the PUT if the PUT is forwarded to a node that doesn't then find itself in the preflist. But this requires some thought.
Thanks @keynslug. Yep, I'll take a look at this tomorrow, @martinsumner.
In this case the problem occurs straight after a node leave, which makes me suspicious about problems we've seen in the shutdown process for leaving nodes.
We discovered an issue in develop-3.0
testing of riak_ensemble
that when a node leaves, before it shuts down it starts a fresh ring (so that it believes it is in a cluster of 1, disconnected, should it restart). This fresh ring is created immediately before shutdown is triggered, but shutdown occurs in an orderly fashion and so services may be up on the node (with this fresh ring active) whilst the node waits for other services to be shutdown, so that the shutdown will occur in order.
What was found with riak_ensemble
is that it wasn't guaranteed that the node was disconnected at this point, and some services were still to shut down. So those service might see this fresh ring, and react to it. So 'riak_ensemble' was taking ownership of stuff on the node that was shutting down, and ensembles would then go unavailable when the riak_ensemble
service itself shutdown on that node.
The riak_core service is the last to shutdown, so it will happily provide the incorrect ring to any service where shutdown has not yet been triggered.
I don't know if it is possible for a PUT to be forwarded to a node (or a GET to be received by a node) in this state, so that because it sees all vnodes as local due to the fresh ring, it will think that PR or PW can be satisfied locally.
To get round this, I introduced (on 3.0 only) the concept of a last-gasp-ring (that is the fresh ring created for the node as it shuts down after leaving):
https://github.com/basho/riak_core/commit/8f6cb273b9001e3afaeb41816dfe3fe22502759e
Then some changes to stop services reacting to a last gasp ring:
e.g. https://github.com/basho/riak_core/commit/450b55869743b1867f5a30f78b300e807d266f0a
One way of checking this would be to use node_confirms
as well as PW/PR - as this would stop a disconnected node from still processing work. node_confirms
is implemented on writes in 2.9.1 (it is a bucket property like PW, and can be set to 2), and will be added to reads in 2.9.2.
If there is a last-gasp-ring issue, then it might be worth pack-porting the last-gasp-ring changes, and then get the PUT_FSM and GET_FSM to check that any ring they're making decisions on isn't a last-gasp.
Thank you for looking into this @martinsumner.
I've given some thought to your hypothesis and skimmed through linked commits and related work. However dare I say I believe there are few contradictions to it. Let me explain.
In this case the problem occurs straight after a node leave, which makes me suspicious about problems we've seen in the shutdown process for leaving nodes.
As for this specific Jepsen test report I've provided, it seems that the anomalous read was observed right after those two extra nodes had just started leaving the cluster, long before they actually shut down. You can see it in jepsen.log
for example. Here I cherry picked few lines out of it and annotated a bit for clarity:
// === jepsen decided it's time extra nodes to leave the cluster
2020-03-25 19:21:43,774{GMT} INFO [jepsen nemesis] jepsen.util: :nemesis :info :stop nil
...
// === last of those nodes acked cluster changes
2020-03-25 19:21:52,428{GMT} INFO [clojure-agent-send-off-pool-15] jepsenized.riak.db: <<< Cluster changes committed
...
// === client read 20
2020-03-25 19:21:55,665{GMT} INFO [jepsen worker 25] jepsen.util: 25 :ok :read ["jepsen638479067-6" 20]
...
// === client read 31, anomalous read
2020-03-25 19:21:55,816{GMT} INFO [jepsen worker 25] jepsen.util: 25 :ok :read ["jepsen638479067-6" 31]
...
// === nodes left cleanly and shut down (after more than 2 minutes)
2020-03-25 19:24:10,609{GMT} INFO [jepsen nemesis] jepsen.util: :nemesis :info :stop ["extra1" "extra2"]
Moreover in our production setup we started to experience occasional lost writes similar to those described in the report after we joined a fresh node to the cluster consisting of 6 healthy nodes, with no leaving nodes altogether. Sure there may be more moving parts I did not consider but I still think it's worth to mention.
I don't know if it is possible for a PUT to be forwarded to a node (or a GET to be received by a node) in this state, so that because it sees all vnodes as local due to the fresh ring, it will think that PR or PW can be satisfied locally.
Yeah, I certainly can imagine that but I still fail to see how a node with a fresh ring could respond to a read op with stale value but not with not found. Or do you envision a situation when both the write of value 31 and the subsequent read of 31 (which are by the way almost exactly three second apart) ended up on the same coordinating node?
Anyway I will certainly try to find some time to make a few runs with scenario modified to exclude node leaving whatsoever, to see if I still could reproduce the issue.
On the other hand I might as well come up with a minimal PR fixing just what I still regard as my primary suspect, to better illustrate my point, since there's a body of evidence which is hard to explain otherwise:
Do you accept and review external PRs?
As always I may be wrong in my assumptions, please feel free to point it out.
Ah, I had missed the fact that Jepsen had logged the leave competing some time after, I had assumed that given the small size of the data set the leave would have been much quicker.
I was thinking on the lines of the two theories combining. When riak_kv stops it brings the API down first, perhaps the fact that we forward PUTs allows the node to be hit in this interim state, when without forwarding it could not be hit by a normal request as the local API is down.
But I can't satisfactorily explain the fact that the reads get the old value without the node being resurrected some way. It doesn't explain why there should be problems on joins not leaves. So treat my theory as a diversion for now.
Couple of questions though ...
The errors in the logs you included in the report were for jepsen638479067-8
not jepsen638479067-6
- were there also similar errors for jepsen638479067-6
?
Is there a consistent time gap between the PUT and its anomalous read?
In terms of a PR, at some stage we need to run through the full suite of riak_test
tests. If you could help with confirming those have passed before submitting the PR that would be useful.
Ideally changes should have there own riak_test test - but in this case I wouldn't expect it to be reproducible in riak_test so that step can be skipped.
Everything gets done on a best endeavours basis in terms of getting the PR into a release, I can't promise much more than that we will do our best.
The errors in the logs you included in the report were for
jepsen638479067-8
notjepsen638479067-6
- were there also similar errors forjepsen638479067-6
?
Gosh, sorry for that confusion. Here are all of them, out of n1/error.log
:
2020-03-25T19:19:00.386369Z [error] Unrecognized message {15919439,{ok,{r_object,<<"registers">>,<<"jepsen638479067-8">>,[{r_content,{dict,3,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[],[],[],[],[],[],[],[],[[<<"X-Riak-VTag">>,50,113,86,90,100,101,73,71,104,90,49,76,106,117,113,77,121,106,86,52,98,70]],[[<<"index">>]],[],[[<<"X-Riak-Last-Modified">>|{1585,163940,383362}]],[],[]}}},<<"37">>}],[{<<"...">>,{13,63752383125}},{<<188,169,128,137,39,255,43,155>>,{5,63752383126}},{<<188,169,128,137,39,255,44,215,0,0,0,1>>,{28,63752383133}},{<<91,17,173,142,39,242,0,107>>,{1,63752383139}},{<<187,196,68,144,40,0,3,163>>,{8,63752383140}},{<<188,169,128,137,39,255,44,166>>,{7,63752383140}}],{dict,1,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[[clean|true]],[]}}},undefined}}}
2020-03-25T19:19:02.354180Z [error] Unrecognized message {53908694,{ok,{r_object,<<"registers">>,<<"jepsen638479067-8">>,[{r_content,{dict,3,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[],[],[],[],[],[],[],[],[[<<"X-Riak-VTag">>,49,85,73,73,113,78,109,51,122,50,116,66,102,119,90,103,55,81,100,57,76,49]],[[<<"index">>]],[],[[<<"X-Riak-Last-Modified">>|{1585,163942,349490}]],[],[]}}},<<"17">>}],[{<<"...">>,{13,63752383125}},{<<188,169,128,137,39,255,43,155>>,{5,63752383126}},{<<188,169,128,137,39,255,44,215,0,0,0,1>>,{28,63752383133}},{<<91,17,173,142,39,242,0,107>>,{1,63752383139}},{<<188,169,128,137,39,255,44,166>>,{7,63752383140}},{<<187,196,68,144,40,0,3,163>>,{11,63752383142}}],{dict,1,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[[clean|true]],[]}}},undefined}}}
2020-03-25T19:19:03.369876Z [error] Unrecognized message {16597173,{ok,{r_object,<<"registers">>,<<"jepsen638479067-8">>,[{r_content,{dict,3,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[],[],[],[],[],[],[],[],[[<<"X-Riak-VTag">>,54,98,51,82,75,78,108,76,89,53,81,75,117,86,86,99,56,121,98,112,107,116]],[[<<"index">>]],[],[[<<"X-Riak-Last-Modified">>|{1585,163943,368193}]],[],[]}}},<<"36">>}],[{<<"...">>,{13,63752383125}},{<<188,169,128,137,39,255,43,155>>,{5,63752383126}},{<<188,169,128,137,39,255,44,215,0,0,0,1>>,{28,63752383133}},{<<188,169,128,137,39,255,44,166>>,{7,63752383140}},{<<91,17,173,142,39,242,0,107>>,{2,63752383143}},{<<187,196,68,144,40,0,3,163>>,{12,63752383143}}],{dict,1,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[[clean|true]],[]}}},undefined}}}
2020-03-25T19:20:57.343565Z [error] Unrecognized message {57936010,{ok,{r_object,<<"registers">>,<<"jepsen638479067-2">>,[{r_content,{dict,3,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[],[],[],[],[],[],[],[],[[<<"X-Riak-VTag">>,53,121,102,114,88,102,119,83,115,81,88,102,118,103,117,48,115,57,101,121,65,50]],[[<<"index">>]],[],[[<<"X-Riak-Last-Modified">>|{1585,164057,339923}]],[],[]}}},<<"37">>}],[{<<"...">>,{3,63752383124}},{<<37,160,209,51,39,247,98,136,0,0,0,1>>,{21,63752383146}},{<<188,169,128,137,40,2,115,135>>,{57,63752383147}},{<<91,17,173,142,39,244,115,226>>,{79,63752383238}},{<<203,174,176,31,44,186,107,84>>,{18,63752383255}},{<<188,169,128,137,40,2,109,198>>,{68,63752383256}},{<<187,196,68,144,40,3,42,144>>,{53,63752383257}}],{dict,1,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[[clean|true]],[]}}},undefined}}}
2020-03-25T19:20:58.611196Z [error] Unrecognized message {91293840,{ok,{r_object,<<"registers">>,<<"jepsen638479067-2">>,[{r_content,{dict,3,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[],[],[],[],[],[],[],[],[[<<"X-Riak-VTag">>,52,104,122,48,109,108,100,56,111,69,85,118,72,70,68,102,81,67,53,68,82,49]],[[<<"index">>]],[],[[<<"X-Riak-Last-Modified">>|{1585,164058,603732}]],[],[]}}},<<"4">>}],[{<<"...">>,{3,63752383124}},{<<37,160,209,51,39,247,98,136,0,0,0,1>>,{21,63752383146}},{<<188,169,128,137,40,2,115,135>>,{57,63752383147}},{<<91,17,173,142,39,244,115,226>>,{79,63752383238}},{<<188,169,128,137,40,2,109,198>>,{68,63752383256}},{<<187,196,68,144,40,3,42,144>>,{56,63752383258}},{<<203,174,176,31,44,186,107,84>>,{19,63752383258}}],{dict,1,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[[clean|true]],[]}}},undefined}}}
2020-03-25T19:21:55.782260Z [error] Unrecognized message {34444012,{ok,{r_object,<<"registers">>,<<"jepsen638479067-6">>,[{r_content,{dict,3,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[],[],[],[],[],[],[],[],[[<<"X-Riak-VTag">>,50,83,86,110,69,50,118,118,102,57,72,69,109,107,76,77,114,110,107,109,78,84]],[[<<"index">>]],[],[[<<"X-Riak-Last-Modified">>|{1585,164115,780297}]],[],[]}}},<<"31">>}],[{<<"...">>,{14,63752383122}},{<<188,169,128,137,39,255,44,99,0,0,0,1>>,{15,63752383125}},{<<"...">>,{41,63752383198}},{<<187,196,68,144,40,0,5,86>>,{65,63752383225}},{<<204,195,116,6,39,255,166,181>>,{66,63752383226}},{<<203,174,176,31,44,182,162,128>>,{39,63752383310}},{<<44,22,157,24,48,210,138,85>>,{58,63752383315}},{<<91,17,173,142,53,198,69,42>>,{4,63752383315}},{<<188,169,128,137,40,5,178,33>>,{85,63752383315}}],{dict,1,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[[clean|true]],[]}}},undefined}}}
Notably here's only one occurence of jepsen638479067-6
but it happens to have exactly the value 31
, same as observed in the anomalous read.
Is there a consistent time gap between the PUT and its anomalous read?
IIRC in all of my successfully reproduced cases (including this report) anomalous read:
This is how I actually started my investigation, there are not much other places where 3 second interval is used as some default:
5 results - 3 files
src/riak_kv_eraser.erl:
88: DeleteMode = app_helper:get_env(riak_kv, delete_mode, 3000),
96: DeleteMode = app_helper:get_env(riak_kv, delete_mode, 3000),
src/riak_kv_put_fsm.erl:
179: set_put_coordinator_failure_timeout(3000).
182: app_helper:get_env(riak_kv, put_coordinator_failure_timeout, 3000).
src/riak_kv_vnode.erl:
724: DeleteMode = app_helper:get_env(riak_kv, delete_mode, 3000),
In terms of a PR, at some stage we need to run through the full suite of
riak_test
tests. If you could help with confirming those have passed before submitting the PR that would be useful.Ideally changes should have there own riak_test test - but in this case I wouldn't expect it to be reproducible in riak_test so that step can be skipped.
Everything gets done on a best endeavours basis in terms of getting the PR into a release, I can't promise much more than that we will do our best.
Quite fair I'd say. I'll take a shot then if I'm able to secure some time for it.
Sorry, I might not be understanding the Jepsen log entry, could you help me please:
{:type :invoke, :f :write, :value ["jepsen638479067-6" 31], :process 27, :time 205105199455, :index 16232}
{:type :ok, :f :write, :value ["jepsen638479067-6" 31], :process 27, :time 205110025753, :vclock ["a85hYGDgzGDKBVIcqgsuGqt//nQggymRL4+VYZLWh+t8ECmePSsbOtX/6yQD2YxAaX6g9FSENETnz0NBQClNoNQ9JKndR1wmaDCwhgGlHIFSP5GkzhwuYVP/v2wrUMoJKPULSUpHbK6EwaWuUKCUBVDKVxshdXrdBnmdbYsagFLqQCk/JKlowbV9psdctYBSzECpACQpkAc0WDcpAqWCYVJZAA==" "a85hYGDgzGDKBVIcqgsuGqt//nQggymRL4+VYZLWh+t8ECmePSsbOtX/6yQD2YxAaX6g9FSENETnz0NBQClNoNQ9JKndR1wmaDCwhgGlHIFSP5GkzhwuYVP/v2wrUMoJKPULSUpHbK6EwaWuUKCUBVDKVxshdXrdBnmdbYsagFLqQCk/JKlowbV9psdctYBSzECpACQpkAc0WDcpAqVCgFKBIKksAA=="], :index 16234}
Does :type ok
mean that the client has at that time received an OK response for the PUT? What are the two vector clocks that it reports at index 16234?
Does
:type ok
mean that the client has at that time received an OK response for the PUT?
That's correct.
What are the two vector clocks that it reports at index 16234?
The first one is pre-request vclock which a client used to make a request, the second one is post-request vclock received along with response.
Sorry, I got distracted with other events.
So is it possible that:
... Other PUTs occur lifting value to y then z ...
The loss of the ack from the the coordinator (on B) back to the original PUT_FSM (on A) would have therefor the potential for data loss with allow_mult=false.
This doesn't explain why this would occur whilst a node leave or add is occurring - why this event should cause such a message loss. But regardless of node leave/join issues - it might suggest that retry_put_coordinator is not safe with allow_mult=false.
Yes, I too believe that this is more or less what is happening.
The loss of the ack from the the coordinator (on B) back to the original PUT_FSM (on A) would have therefor the potential for data loss with allow_mult=false.
This doesn't explain why this would occur whilst a node leave or add is occurring - why this event should cause such a message loss. But regardless of node leave/join issues - it might suggest that retry_put_coordinator is not safe with allow_mult=false.
My best bet is this. AFAIK ring state can not possibly be strongly consistent across a cluster, only eventually consistent, given its changes are gossiped. Which means that in the presence of constant ownership transfers some two nodes may disagree on the ring state and consequently have different sets of primaries for some key.
- ack of forwarding is not returned to Node A (e.g. message lost), but Node B still processes and completes the PUT (to value x)
...This is why I believe there was no message loss (since we would observe it, somewhere in the operation history or at least in logs). Presumably, ack was not returned to Node A because Node B had forwarded once again and did not care to ack to Node A afterwards.
I made an attempt to fix it with #1755. I must confess it's not ready for review yet, but I think it at least may serve as an illustration of sorts.
Fixed now merged and released in both 2.9 and 3.0
Riak KV may lost recent writes in the presence of heavy handoff activity.
I've made a test with the help of Jepsen framework specifically to investigate this kind of buggy behaviour because it hit us several times already in our production environment. This test observes an occurence of this bug almost every run to be honest. An ordinary test run follows roughly these steps:
I have attached a Jepsen report archive of one such test run. Most notable observations are in
results.edn
, here's an excerpt, redacted for brewity:Basically this all means that the client 25 while operating on key
jepsen638479067-6
read value20
and subsequently read value31
quite unexpectedly, all in the absence of concurrent writes under this key.Skimming through
history.edn
, here's relevant history fragment, again redacted a bit for brewity:It can be seen clearly that an act of writing
31
at index 16232 effected read at index 16407, though in the meantime 3 subsequent writes have succeeded (16245, 16356, 16361).Affected version(s)
Though it won't hurt to mention that I have successfully reproduced this bug under riak 2.2.3 as well.
Steps to reproduce
Clone aforementioned repo.
Start a test run with:
Wait for it to finish cleanly.
Note that tearing down node with
riak stop
on 2.9.1 may become stuck for some reason at the end of a run, simple ^C would be fine then.Look at the analysis report in
store/latest/results.edn
.Suspected cause
Essentially this is caused by the fact that put fsm does not account for the possibility of multiple forwards to another coordinator. I'll try to illustrate:
retry_put_coordinator_failure
istrue
which is the default per cuttlefish schema.This is why I believe there are occasional unrecognized messages in logs, which are extraneous responses to original requests retried unwarrantedly:
One more evidence which strengthen my suspicion is that turning off
retry_put_coordinator_failure
makes this buggy behaviuor go away with quite high probability, if not completely, given the number of test runs I've performed so far.Additional observations
Moreover it's not clear for me why acknowledgement is being sent in the execute state of an fsm, but not in the validate state since there's a possibility that execute will be skipped altogether, in the event of parameter violation for example.