Recent writes loss in the presence of heavy handoff activity

Riak KV may lost recent writes in the presence of heavy handoff activity.

I've made a test with the help of Jepsen framework specifically to investigate this kind of buggy behaviour because it hit us several times already in our production environment. This test observes an occurence of this bug almost every run to be honest. An ordinary test run follows roughly these steps:

Set up 5 riak kv nodes in a docker-compose environment.
Join them together in a single cluster.
Start pouring reads and writes over some small keyspace (10 keys was sufficient most of the time) concurrently through some small number of clients (again, 40 was sufficient), each putting as little as 1 rps. Note that this process would continue until the very end. Note also that clients take effort to ensure that writes are causally consistent (using past reads' vclocks) and performed with n = 3, r/w/pr/pw = quorum, lww = false, allow_mult = false.
Give the cluster 40 seconds to stabilise. Most of the time the cluster would still have a number of ongoing transfers by this time, but that's ok.
Then, set up 2 extra nodes and join them in the cluster.
Again, give the cluster another 120 seconds to stabilise, after that order those 2 extra nodes to leave the cluster.
Wait for them to leave cleanly. This is the most time consuming phase, but usually 200 seconds was sufficient for it to finish.
Repeat the process starting from step 4, until the test run time limit passes. Usually I ran it for 15 minutes, it was enough for 3 such cycles.

I have attached a Jepsen report archive of one such test run. Most notable observations are in results.edn, here's an excerpt, redacted for brewity:

...
"jepsen638479067-6" {:valid? false,
    :configs (...),
    :final-paths ([{:op {:process 25,
                        :type :ok,
                        :f :read,
                        :value 20,
                        :index 16396,
                        :time 207095405146,
                        :vclock "a85hYGD...JZAA=="},
                    :model #knossos.model.CASRegister{:value 20}}
                    {:op {:process 25,
                        :type :ok,
                        :f :read,
                        :value 31,
                        :index 16408,
                        :time 207247072654,
                        :vclock "a85hYGD...SyAA=="},
                    :model #knossos.model.Inconsistent{:msg "can't read 31 from register 20"}}]),
    :previous-ok {...},
    :last-op {...},
    :op {...}},
...

Basically this all means that the client 25 while operating on key jepsen638479067-6 read value 20 and subsequently read value 31 quite unexpectedly, all in the absence of concurrent writes under this key.

Skimming through history.edn, here's relevant history fragment, again redacted a bit for brewity:

{:type :invoke,  :f :write,  :value 31,   :process 27,  :time 205105199455,        :index 16232}
{:type :ok,      :f :write,  :value 31,   :process 27,  :time 205110025753,  ...,  :index 16234}
{:type :invoke,  :f :write,  :value 5,    :process 26,  :time 205222111586,        :index 16245}
{:type :ok,      :f :write,  :value 5,    :process 26,  :time 205224722230,  ...,  :index 16246}
{:type :invoke,  :f :read,   :value nil,  :process 25,  :time 205256891475,        :index 16250}
{:type :ok,      :f :read,   :value 5,    :process 25,  :time 205260160969,  ...,  :index 16252}
{:type :invoke,  :f :read,   :value nil,  :process 24,  :time 206523664708,        :index 16343}
{:type :ok,      :f :read,   :value 5,    :process 24,  :time 206526735511,  ...,  :index 16344}
{:type :invoke,  :f :write,  :value 30,   :process 26,  :time 206664972077,        :index 16356}
{:type :ok,      :f :write,  :value 30,   :process 26,  :time 206673674701,  ...,  :index 16358}
{:type :invoke,  :f :write,  :value 20,   :process 27,  :time 206712230895,        :index 16361}
{:type :ok,      :f :write,  :value 20,   :process 27,  :time 206715122992,  ...,  :index 16362}
{:type :invoke,  :f :read,   :value nil,  :process 25,  :time 207091635937,        :index 16394}
{:type :ok,      :f :read,   :value 20,   :process 25,  :time 207095405146,  ...,  :index 16396}
{:type :invoke,  :f :read,   :value nil,  :process 25,  :time 207244686086,        :index 16407}
{:type :ok,      :f :read,   :value 31,   :process 25,  :time 207247072654,  ...,  :index 16408}

It can be seen clearly that an act of writing 31 at index 16232 effected read at index 16407, though in the meantime 3 subsequent writes have succeeded (16245, 16356, 16361).

Affected version(s)

riak_kv_version       : riak_kv-2.9.1.1
riak_core_version     : riak_kv-2.9.1
riak_api_version      : riak_kv-2.9.1
riak_pb_version       : riak_kv-2.9.1
riak_pipe_version     : riak_kv-2.9.1

Though it won't hurt to mention that I have successfully reproduced this bug under riak 2.2.3 as well.

Steps to reproduce

Clone aforementioned repo.

Start a test run with:

$ cd system
$ docker-compose up -d
$ docker-compose exec control ./run-test

Wait for it to finish cleanly.

Note that tearing down node with riak stop on 2.9.1 may become stuck for some reason at the end of a run, simple ^C would be fine then.
Look at the analysis report in store/latest/results.edn.

Suspected cause

Essentially this is caused by the fact that put fsm does not account for the possibility of multiple forwards to another coordinator. I'll try to illustrate:

Node n1 receives put request and spin up a fsm to handle it.
This fsm @ n1 decides to forward request to another node n2 (which happens to be in the preflist from the point of view of n1), spawns fsm there and waits for it to acknowledge their readiness, given retry_put_coordinator_failure is true which is the default per cuttlefish schema.
Again, this fsm @ n2 may as I suspect under certain circumstances decide that it is in fact not in the preflist anymore, and forward request to another node n3, again ending up waiting for acknowlege afterwards. Here, possible circumstances are high vnode churn in the cluster.
The fsm @ n3 happily acknowlegdes readiness to fsm @ n2 and proceeds with coordinating put operation.
The fsm @ n2 receives acknowledgement and stops, without any other interaction with fsm @ n1.
This fsm @ n1 times out waiting for the acknowledgement and retries forwarding request after 3 seconds by default, whereas original request is quite likely handled successfully already at this point.

This is why I believe there are occasional unrecognized messages in logs, which are extraneous responses to original requests retried unwarrantedly:

2020-03-25T19:19:00.386369Z [error] Unrecognized message {15919439,{ok,{r_object,<<"registers">>,<<"jepsen638479067-8">>,[{r_content,{dict,3,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[],[],[],[],[],[],[],[],[[<<"X-Riak-VTag">>,50,113,86,90,100,101,73,71,104,90,49,76,106,117,113,77,121,106,86,52,98,70]],[[<<"index">>]],[],[[<<"X-Riak-Last-Modified">>|{1585,163940,383362}]],[],[]}}},<<"37">>}],[{<<"...">>,{13,63752383125}},{<<188,169,128,137,39,255,43,155>>,{5,63752383126}},{<<188,169,128,137,39,255,44,215,0,0,0,1>>,{28,63752383133}},{<<91,17,173,142,39,242,0,107>>,{1,63752383139}},{<<187,196,68,144,40,0,3,163>>,{8,63752383140}},{<<188,169,128,137,39,255,44,166>>,{7,63752383140}}],{dict,1,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[[clean|true]],[]}}},undefined}}}
2020-03-25T19:19:02.354180Z [error] Unrecognized message {53908694,{ok,{r_object,<<"registers">>,<<"jepsen638479067-8">>,[{r_content,{dict,3,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[],[],[],[],[],[],[],[],[[<<"X-Riak-VTag">>,49,85,73,73,113,78,109,51,122,50,116,66,102,119,90,103,55,81,100,57,76,49]],[[<<"index">>]],[],[[<<"X-Riak-Last-Modified">>|{1585,163942,349490}]],[],[]}}},<<"17">>}],[{<<"...">>,{13,63752383125}},{<<188,169,128,137,39,255,43,155>>,{5,63752383126}},{<<188,169,128,137,39,255,44,215,0,0,0,1>>,{28,63752383133}},{<<91,17,173,142,39,242,0,107>>,{1,63752383139}},{<<188,169,128,137,39,255,44,166>>,{7,63752383140}},{<<187,196,68,144,40,0,3,163>>,{11,63752383142}}],{dict,1,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[[clean|true]],[]}}},undefined}}}

One more evidence which strengthen my suspicion is that turning off retry_put_coordinator_failure makes this buggy behaviuor go away with quite high probability, if not completely, given the number of test runs I've performed so far.

Additional observations

Moreover it's not clear for me why acknowledgement is being sent in the execute state of an fsm, but not in the validate state since there's a possibility that execute will be skipped altogether, in the event of parameter violation for example.

Many thanks for the high quality investigation, and excellent write-up.

I have another bug I'm focused on at the moment, but will try and look at this as soon as possible. @martincox would you, or someone in your team, be able to look too?

My initial reaction if your hunch is correct, is that we should just fail the PUT if the PUT is forwarded to a node that doesn't then find itself in the preflist. But this requires some thought.

Thanks @keynslug. Yep, I'll take a look at this tomorrow, @martinsumner.

In this case the problem occurs straight after a node leave, which makes me suspicious about problems we've seen in the shutdown process for leaving nodes.

We discovered an issue in develop-3.0 testing of riak_ensemble that when a node leaves, before it shuts down it starts a fresh ring (so that it believes it is in a cluster of 1, disconnected, should it restart). This fresh ring is created immediately before shutdown is triggered, but shutdown occurs in an orderly fashion and so services may be up on the node (with this fresh ring active) whilst the node waits for other services to be shutdown, so that the shutdown will occur in order.

What was found with riak_ensemble is that it wasn't guaranteed that the node was disconnected at this point, and some services were still to shut down. So those service might see this fresh ring, and react to it. So 'riak_ensemble' was taking ownership of stuff on the node that was shutting down, and ensembles would then go unavailable when the riak_ensemble service itself shutdown on that node.

The riak_core service is the last to shutdown, so it will happily provide the incorrect ring to any service where shutdown has not yet been triggered.

I don't know if it is possible for a PUT to be forwarded to a node (or a GET to be received by a node) in this state, so that because it sees all vnodes as local due to the fresh ring, it will think that PR or PW can be satisfied locally.

To get round this, I introduced (on 3.0 only) the concept of a last-gasp-ring (that is the fresh ring created for the node as it shuts down after leaving):

https://github.com/basho/riak_core/commit/8f6cb273b9001e3afaeb41816dfe3fe22502759e

Then some changes to stop services reacting to a last gasp ring:

e.g. https://github.com/basho/riak_core/commit/450b55869743b1867f5a30f78b300e807d266f0a

One way of checking this would be to use node_confirms as well as PW/PR - as this would stop a disconnected node from still processing work. node_confirms is implemented on writes in 2.9.1 (it is a bucket property like PW, and can be set to 2), and will be added to reads in 2.9.2.

If there is a last-gasp-ring issue, then it might be worth pack-porting the last-gasp-ring changes, and then get the PUT_FSM and GET_FSM to check that any ring they're making decisions on isn't a last-gasp.

Thank you for looking into this @martinsumner.

I've given some thought to your hypothesis and skimmed through linked commits and related work. However dare I say I believe there are few contradictions to it. Let me explain.

In this case the problem occurs straight after a node leave, which makes me suspicious about problems we've seen in the shutdown process for leaving nodes.

As for this specific Jepsen test report I've provided, it seems that the anomalous read was observed right after those two extra nodes had just started leaving the cluster, long before they actually shut down. You can see it in jepsen.log for example. Here I cherry picked few lines out of it and annotated a bit for clarity:

// === jepsen decided it's time extra nodes to leave the cluster
2020-03-25 19:21:43,774{GMT}    INFO    [jepsen nemesis] jepsen.util: :nemesis  :info   :stop   nil    
...
// === last of those nodes acked cluster changes
2020-03-25 19:21:52,428{GMT}    INFO    [clojure-agent-send-off-pool-15] jepsenized.riak.db: <<< Cluster changes committed
...
// === client read 20
2020-03-25 19:21:55,665{GMT}    INFO    [jepsen worker 25] jepsen.util: 25  :ok :read   ["jepsen638479067-6" 20]
...
// === client read 31, anomalous read
2020-03-25 19:21:55,816{GMT}    INFO    [jepsen worker 25] jepsen.util: 25  :ok :read   ["jepsen638479067-6" 31]
...
// === nodes left cleanly and shut down (after more than 2 minutes)
2020-03-25 19:24:10,609{GMT}    INFO    [jepsen nemesis] jepsen.util: :nemesis  :info   :stop   ["extra1" "extra2"]

Moreover in our production setup we started to experience occasional lost writes similar to those described in the report after we joined a fresh node to the cluster consisting of 6 healthy nodes, with no leaving nodes altogether. Sure there may be more moving parts I did not consider but I still think it's worth to mention.

I don't know if it is possible for a PUT to be forwarded to a node (or a GET to be received by a node) in this state, so that because it sees all vnodes as local due to the fresh ring, it will think that PR or PW can be satisfied locally.

Yeah, I certainly can imagine that but I still fail to see how a node with a fresh ring could respond to a read op with stale value but not with not found. Or do you envision a situation when both the write of value 31 and the subsequent read of 31 (which are by the way almost exactly three second apart) ended up on the same coordinating node?

Anyway I will certainly try to find some time to make a few runs with scenario modified to exclude node leaving whatsoever, to see if I still could reproduce the issue.

On the other hand I might as well come up with a minimal PR fixing just what I still regard as my primary suspect, to better illustrate my point, since there's a body of evidence which is hard to explain otherwise:

anomalous read happening almost exactly 3 seconds after "respective" write of the same value,
coinciding with this errors in the riak logs,
turning off coordinator retries makes these anomalies disappear.

Do you accept and review external PRs?

As always I may be wrong in my assumptions, please feel free to point it out.

Ah, I had missed the fact that Jepsen had logged the leave competing some time after, I had assumed that given the small size of the data set the leave would have been much quicker.

I was thinking on the lines of the two theories combining. When riak_kv stops it brings the API down first, perhaps the fact that we forward PUTs allows the node to be hit in this interim state, when without forwarding it could not be hit by a normal request as the local API is down.

But I can't satisfactorily explain the fact that the reads get the old value without the node being resurrected some way. It doesn't explain why there should be problems on joins not leaves. So treat my theory as a diversion for now.

Couple of questions though ...

The errors in the logs you included in the report were for jepsen638479067-8 not jepsen638479067-6 - were there also similar errors for jepsen638479067-6?

Is there a consistent time gap between the PUT and its anomalous read?

In terms of a PR, at some stage we need to run through the full suite of riak_test tests. If you could help with confirming those have passed before submitting the PR that would be useful.

Ideally changes should have there own riak_test test - but in this case I wouldn't expect it to be reproducible in riak_test so that step can be skipped.

Everything gets done on a best endeavours basis in terms of getting the PR into a release, I can't promise much more than that we will do our best.

The errors in the logs you included in the report were for jepsen638479067-8 not jepsen638479067-6 - were there also similar errors for jepsen638479067-6?

Gosh, sorry for that confusion. Here are all of them, out of n1/error.log:

2020-03-25T19:19:00.386369Z [error] Unrecognized message {15919439,{ok,{r_object,<<"registers">>,<<"jepsen638479067-8">>,[{r_content,{dict,3,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[],[],[],[],[],[],[],[],[[<<"X-Riak-VTag">>,50,113,86,90,100,101,73,71,104,90,49,76,106,117,113,77,121,106,86,52,98,70]],[[<<"index">>]],[],[[<<"X-Riak-Last-Modified">>|{1585,163940,383362}]],[],[]}}},<<"37">>}],[{<<"...">>,{13,63752383125}},{<<188,169,128,137,39,255,43,155>>,{5,63752383126}},{<<188,169,128,137,39,255,44,215,0,0,0,1>>,{28,63752383133}},{<<91,17,173,142,39,242,0,107>>,{1,63752383139}},{<<187,196,68,144,40,0,3,163>>,{8,63752383140}},{<<188,169,128,137,39,255,44,166>>,{7,63752383140}}],{dict,1,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[[clean|true]],[]}}},undefined}}}
2020-03-25T19:19:02.354180Z [error] Unrecognized message {53908694,{ok,{r_object,<<"registers">>,<<"jepsen638479067-8">>,[{r_content,{dict,3,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[],[],[],[],[],[],[],[],[[<<"X-Riak-VTag">>,49,85,73,73,113,78,109,51,122,50,116,66,102,119,90,103,55,81,100,57,76,49]],[[<<"index">>]],[],[[<<"X-Riak-Last-Modified">>|{1585,163942,349490}]],[],[]}}},<<"17">>}],[{<<"...">>,{13,63752383125}},{<<188,169,128,137,39,255,43,155>>,{5,63752383126}},{<<188,169,128,137,39,255,44,215,0,0,0,1>>,{28,63752383133}},{<<91,17,173,142,39,242,0,107>>,{1,63752383139}},{<<188,169,128,137,39,255,44,166>>,{7,63752383140}},{<<187,196,68,144,40,0,3,163>>,{11,63752383142}}],{dict,1,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[[clean|true]],[]}}},undefined}}}
2020-03-25T19:19:03.369876Z [error] Unrecognized message {16597173,{ok,{r_object,<<"registers">>,<<"jepsen638479067-8">>,[{r_content,{dict,3,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[],[],[],[],[],[],[],[],[[<<"X-Riak-VTag">>,54,98,51,82,75,78,108,76,89,53,81,75,117,86,86,99,56,121,98,112,107,116]],[[<<"index">>]],[],[[<<"X-Riak-Last-Modified">>|{1585,163943,368193}]],[],[]}}},<<"36">>}],[{<<"...">>,{13,63752383125}},{<<188,169,128,137,39,255,43,155>>,{5,63752383126}},{<<188,169,128,137,39,255,44,215,0,0,0,1>>,{28,63752383133}},{<<188,169,128,137,39,255,44,166>>,{7,63752383140}},{<<91,17,173,142,39,242,0,107>>,{2,63752383143}},{<<187,196,68,144,40,0,3,163>>,{12,63752383143}}],{dict,1,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[[clean|true]],[]}}},undefined}}}
2020-03-25T19:20:57.343565Z [error] Unrecognized message {57936010,{ok,{r_object,<<"registers">>,<<"jepsen638479067-2">>,[{r_content,{dict,3,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[],[],[],[],[],[],[],[],[[<<"X-Riak-VTag">>,53,121,102,114,88,102,119,83,115,81,88,102,118,103,117,48,115,57,101,121,65,50]],[[<<"index">>]],[],[[<<"X-Riak-Last-Modified">>|{1585,164057,339923}]],[],[]}}},<<"37">>}],[{<<"...">>,{3,63752383124}},{<<37,160,209,51,39,247,98,136,0,0,0,1>>,{21,63752383146}},{<<188,169,128,137,40,2,115,135>>,{57,63752383147}},{<<91,17,173,142,39,244,115,226>>,{79,63752383238}},{<<203,174,176,31,44,186,107,84>>,{18,63752383255}},{<<188,169,128,137,40,2,109,198>>,{68,63752383256}},{<<187,196,68,144,40,3,42,144>>,{53,63752383257}}],{dict,1,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[[clean|true]],[]}}},undefined}}}
2020-03-25T19:20:58.611196Z [error] Unrecognized message {91293840,{ok,{r_object,<<"registers">>,<<"jepsen638479067-2">>,[{r_content,{dict,3,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[],[],[],[],[],[],[],[],[[<<"X-Riak-VTag">>,52,104,122,48,109,108,100,56,111,69,85,118,72,70,68,102,81,67,53,68,82,49]],[[<<"index">>]],[],[[<<"X-Riak-Last-Modified">>|{1585,164058,603732}]],[],[]}}},<<"4">>}],[{<<"...">>,{3,63752383124}},{<<37,160,209,51,39,247,98,136,0,0,0,1>>,{21,63752383146}},{<<188,169,128,137,40,2,115,135>>,{57,63752383147}},{<<91,17,173,142,39,244,115,226>>,{79,63752383238}},{<<188,169,128,137,40,2,109,198>>,{68,63752383256}},{<<187,196,68,144,40,3,42,144>>,{56,63752383258}},{<<203,174,176,31,44,186,107,84>>,{19,63752383258}}],{dict,1,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[[clean|true]],[]}}},undefined}}}
2020-03-25T19:21:55.782260Z [error] Unrecognized message {34444012,{ok,{r_object,<<"registers">>,<<"jepsen638479067-6">>,[{r_content,{dict,3,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[],[],[],[],[],[],[],[],[[<<"X-Riak-VTag">>,50,83,86,110,69,50,118,118,102,57,72,69,109,107,76,77,114,110,107,109,78,84]],[[<<"index">>]],[],[[<<"X-Riak-Last-Modified">>|{1585,164115,780297}]],[],[]}}},<<"31">>}],[{<<"...">>,{14,63752383122}},{<<188,169,128,137,39,255,44,99,0,0,0,1>>,{15,63752383125}},{<<"...">>,{41,63752383198}},{<<187,196,68,144,40,0,5,86>>,{65,63752383225}},{<<204,195,116,6,39,255,166,181>>,{66,63752383226}},{<<203,174,176,31,44,182,162,128>>,{39,63752383310}},{<<44,22,157,24,48,210,138,85>>,{58,63752383315}},{<<91,17,173,142,53,198,69,42>>,{4,63752383315}},{<<188,169,128,137,40,5,178,33>>,{85,63752383315}}],{dict,1,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[[clean|true]],[]}}},undefined}}}

Notably here's only one occurence of jepsen638479067-6 but it happens to have exactly the value 31, same as observed in the anomalous read.

Is there a consistent time gap between the PUT and its anomalous read?

IIRC in all of my successfully reproduced cases (including this report) anomalous read:

coincides with one or more errors in some riak node's logs,
happens roughly 3 seconds after the write of the same value, which suspiciously corresponds to the default retry timeout.

This is how I actually started my investigation, there are not much other places where 3 second interval is used as some default:

5 results - 3 files

src/riak_kv_eraser.erl:
  88:     DeleteMode = app_helper:get_env(riak_kv, delete_mode, 3000),
  96:     DeleteMode = app_helper:get_env(riak_kv, delete_mode, 3000),

src/riak_kv_put_fsm.erl:
  179:     set_put_coordinator_failure_timeout(3000).
  182:     app_helper:get_env(riak_kv, put_coordinator_failure_timeout, 3000).

src/riak_kv_vnode.erl:
  724:     DeleteMode = app_helper:get_env(riak_kv, delete_mode, 3000),

In terms of a PR, at some stage we need to run through the full suite of riak_test tests. If you could help with confirming those have passed before submitting the PR that would be useful.

Ideally changes should have there own riak_test test - but in this case I wouldn't expect it to be reproducible in riak_test so that step can be skipped.

Everything gets done on a best endeavours basis in terms of getting the PR into a release, I can't promise much more than that we will do our best.

Quite fair I'd say. I'll take a shot then if I'm able to secure some time for it.

Sorry, I might not be understanding the Jepsen log entry, could you help me please:

{:type :invoke, :f :write, :value ["jepsen638479067-6" 31], :process 27, :time 205105199455, :index 16232}
{:type :ok, :f :write, :value ["jepsen638479067-6" 31], :process 27, :time 205110025753, :vclock ["a85hYGDgzGDKBVIcqgsuGqt//nQggymRL4+VYZLWh+t8ECmePSsbOtX/6yQD2YxAaX6g9FSENETnz0NBQClNoNQ9JKndR1wmaDCwhgGlHIFSP5GkzhwuYVP/v2wrUMoJKPULSUpHbK6EwaWuUKCUBVDKVxshdXrdBnmdbYsagFLqQCk/JKlowbV9psdctYBSzECpACQpkAc0WDcpAqWCYVJZAA==" "a85hYGDgzGDKBVIcqgsuGqt//nQggymRL4+VYZLWh+t8ECmePSsbOtX/6yQD2YxAaX6g9FSENETnz0NBQClNoNQ9JKndR1wmaDCwhgGlHIFSP5GkzhwuYVP/v2wrUMoJKPULSUpHbK6EwaWuUKCUBVDKVxshdXrdBnmdbYsagFLqQCk/JKlowbV9psdctYBSzECpACQpkAc0WDcpAqVCgFKBIKksAA=="], :index 16234}

Does :type ok mean that the client has at that time received an OK response for the PUT? What are the two vector clocks that it reports at index 16234?

Does :type ok mean that the client has at that time received an OK response for the PUT?

That's correct.

What are the two vector clocks that it reports at index 16234?

The first one is pre-request vclock which a client used to make a request, the second one is post-request vclock received along with response.

Sorry, I got distracted with other events.

So is it possible that:

put is received by put_fsm on Node A
put is forwarded by put_fsm on Node B (e.g. because A not in preflist)
ack of forwarding is not returned to Node A (e.g. message lost), but Node B still processes and completes the PUT (to value x)

... Other PUTs occur lifting value to y then z ...

node A put_fsm on timeout of not receiving ACK, returns to the prepare state and will either re-forward or apply locally replaying the PUT 3s later https://github.com/basho/riak_kv/blob/develop-2.9/src/riak_kv_put_fsm.erl#L210-L222
as allow_mult=false, the re-played PUT (value x) is a sibling of z, but has a higher timestamp (and so will be the chosen response) and so z is lost

The loss of the ack from the the coordinator (on B) back to the original PUT_FSM (on A) would have therefor the potential for data loss with allow_mult=false.

This doesn't explain why this would occur whilst a node leave or add is occurring - why this event should cause such a message loss. But regardless of node leave/join issues - it might suggest that retry_put_coordinator is not safe with allow_mult=false.

Yes, I too believe that this is more or less what is happening.

The loss of the ack from the the coordinator (on B) back to the original PUT_FSM (on A) would have therefor the potential for data loss with allow_mult=false.

This doesn't explain why this would occur whilst a node leave or add is occurring - why this event should cause such a message loss. But regardless of node leave/join issues - it might suggest that retry_put_coordinator is not safe with allow_mult=false.

My best bet is this. AFAIK ring state can not possibly be strongly consistent across a cluster, only eventually consistent, given its changes are gossiped. Which means that in the presence of constant ownership transfers some two nodes may disagree on the ring state and consequently have different sets of primaries for some key.

ack of forwarding is not returned to Node A (e.g. message lost), but Node B still processes and completes the PUT (to value x)

...This is why I believe there was no message loss (since we would observe it, somewhere in the operation history or at least in logs). Presumably, ack was not returned to Node A because Node B had forwarded once again and did not care to ack to Node A afterwards.

I made an attempt to fix it with #1755. I must confess it's not ready for review yet, but I think it at least may serve as an illustration of sorts.

Fixed now merged and released in both 2.9 and 3.0

basho / riak_kv