Possible data loss on corruption of coordinating node replica

engelsanchez commented 11 years ago

Russell brought this up today, and perhaps he can elaborate on this:

A replica becomes unavailable in a put coordinating node due to corruption or user error.
The incoming object may or may not contain a vector clock.
We get a not_found in the local put. We have lost the information corresponding to this vnode id.
The vector clock created could have a stale number for this vnode actor id. Either 1 if no vclock or < than the real one if more writes have happened since.
The vector clock is subsumed by the other replicas when sent out, so this write will lose and will eventually be read repaired completely out of existence.

/cc @jtuple @jrwest @evanmcc @Vagabond @russelldb

russelldb commented 11 years ago

Yeah. That captures it @engelsanchez. I think this is a problem we introduced recently by treating errors on local read as notfound. And I think the answer is as simple as creating a new vnodeid as soon as local corruption is detected.

If for whatever reason a coordinating PUT to a vnode fails to read an existing K/V out of the backend (we have a catch for it here:- https://github.com/basho/riak_kv/blob/develop/src/riak_kv_vnode.erl#L1258) then we have this problem.

@engelsanchez pointed out we have the same problem if a user removes the backend data and not the vnodeid, or removes the backend data while the vnode is running. At this point I'd say all bets are off and there is nothing we can do.

In the case of corruption where we do get an error attempting to read the data, we can do something (get the data from another vnode, generate a new vnodeid.) Ignoring the error and returning a notfound is not correct though.

Imagine putting to Key K with an empty vclock, the item has been written by vnode A before. Vnode A coordinates the write for this PUT. Vnode A's backend is corrupt due to cosmic rays. Vnode A attempts to read key K to get a local vector clock. Vnode A gets an error but we silently parlay that to a notfound. Vnode A creates a vclock entry of {A, 1} for the PUT and stores the item (even though the item on disk (that we can't read) has entry {A, 3}).

K's data is returned as a frontier object (not!) to the FSM which sends it downstream to Vnode B and C for local merge. Vnode B and C see that their local vclocks for K dominate the one they just received and drop the write and ack the FSM. The FSMs ack the user.

The user tries to read K. A's replica is dominated by B's and C's, read repair kicks in and A's data is replaced. We lost the data.

If on a read error on a co-ordinating PUT we increment the vnodeId at A, the write would be a sibling of the data at B and C and survive (even if it is a false sibling, it is better than a dropped write.)

slfritchie commented 11 years ago

@russelldb I don't believe that changing/incrementing the vnodeid is feasible. Even if the backend could tell us 100% of the time that there were corruption, we run the risk of telling us that it noticed corruption 1000 time/second (to pull an arbitrary but not totally silly figure out of the air). There is a different kind of actor explosion happening then, wouldn't it?

Is this totally crazy?

--- src/vclock.erl  2013-09-27 18:38:57.000000000 +0900
+++ /tmp/vclock.erl 2013-10-09 16:23:11.000000000 +0900
@@ -123,12 +123,15 @@
 increment(Node, IncTs, VClock) ->
     {{_Ctr, _TS}=C1,NewV} = case lists:keytake(Node, 1, VClock) of
                                 false ->
-                                    {{1, IncTs}, VClock};
+                                    {{big_starting_point(), IncTs}, VClock};
                                 {value, {_N, {C, _T}}, ModV} ->
                                     {{C + 1, IncTs}, ModV}
                             end,
     [{Node,C1}|NewV].

+big_starting_point() ->
+    {X, Y, Z} = os:timestamp(),
+    (((X * 1000000) + Y) * 1000000) + Z. % Microseconds, lots & lots of 'em

 % @doc Return the list of all nodes that have ever incremented VClock.
 -spec all_nodes(VClock :: vclock()) -> [vclock_node()].

cc: @jtuple @jonmeredith Thoughts?

russelldb commented 11 years ago

Just to summarize a long discussion in HipChat:

Per vnode counter. Increment it on every co-ordinated write. Paranoidly persist it to disk every $THRESHOLD writes (with the vnodeid) Read it on start up and add $THRESHOLD to it (what if $THRESHOLD changes between start and stop (from conf, say?)) Remove it (like we do with vnodeid) when handoff completes. (Suffice to say the counter is just like the vnodeid is now as far as persistence goes, except it is persisted more often) When a local get on a coordinating put returns notfound use the vnodeid counter+1 as the initial counter value in the vector clock (thus ensuring a frontier counter for the coordinating actor)

(Is that it @jtuple, @slfritchie, @jonmeredith ?)

slfritchie commented 11 years ago

The counter's starting value should be really big, e.g. billions? Otherwise, you're in the same bad place as using the constant 1 like today?

russelldb commented 11 years ago

@slfritchie I don't understand why? If there is no counter on disk to read, but there is a vnodeid, then create new vnodeid, otherwise the counter value on disk + threshold means we get a frontier count. Why the billions?

russelldb commented 10 years ago

@slfritchie I am starting work on this now, wondering if you could articulate your objection to starting at 1, or are you cool with it?

russelldb commented 10 years ago

The plan above is no good. See riak_kv#726 and https://github.com/basho/riak_test/compare/bug;rdb;gh679 for details.

I think there are a number of possible sources of this behaviour and we should probably partition them a little better.

One question? Why store the vnodeid on disk separate from the backend? If the vnodeid was in the backend then one source of the issue (backend data deleted, vnodeid data is not) would be resolved. I imagine this leads to issues for the memory backend, though.

We could also revert the change that treats an error on local read as a notfound, the user / client / system can then retry and hit another coordinator.

Then we are left with the case where there is undetectable corruption (only a/some key/value is affected but the vnodeid is still present in the backend) I have no idea how to solve this case at the moment.

I also need to do a little analysis to figure out if there is different behaviour when the PUT has a vclock and when it does not.

If we stick with the solution in riak_kv#726 the problem is manifested as a concurrent write causally dominating a write on disk. If we stick with what we have a write concurrent with data at a replica is accepted and subsequently dropped as it is dominated by what is on disk. Which is worse? Is there another way? I'm wondering if I'm just over thinking a tiny edge case?

cc @jonmeredith @engelsanchez @slfritchie @jtuple

russelldb commented 10 years ago

OK, I had a think and hacked together a branch that partially solves the problem.

Since we never compare vclocks across, creating a new vnodeId (based on the vnode Id ++ some epoch) as the actor whenever there is a notfound on a coordinating put solves the issue.

It raises issues too…but…discuss.

russelldb commented 10 years ago

This[1] works* for the case where the local get is a notfound and no vclock is provided on the put

[1] https://github.com/basho/riak_kv/compare/bug;rdb;gh679-crazy-ivan

Passes the test

slfritchie commented 10 years ago

Hi, Russell. I've a couple of worries, ignoring the fact that I still like the big_starting_point() scheme better. :-)

The riak_kv_vnode:write_vnode_status/2 function isn't sufficiently robust. It's outside of the scope of this PR so far, but I recommend that it be fixed. IMO file:write_file/2 isn't sufficient because it doesn't trigger an fsync(2) system call, so it's possible for a system crash to replace the old file with a truncated new file (probably all the way to zero bytes).
The current implementation is also relying quite a bit on file operations that get serialized through the Erlang file_server_2 process. The latency added by that serialization may be something to worry about.

One way around the latency hit would be to do the vnode status update asynchronously:

Spawn a worker pid to do the update when the counter is approaches Threshold minus AsyncFudgeCount (e.g. 10).
When the worker pid is done, send a yo_i_am_done_updating_counter message to the vnode.
When the counter reaches Threshold, then block on a receive of the yo_... message.
- If the receive returns immediately, great, we're done.
- If we block for a while, bummer, but at least we've had AsyncFudgeCount increments that we were able to do before blocking.

gburd commented 10 years ago

Sounds like we're overload the meaning of notfound and should have a {error, repair_this_key_please} return value from backends.

russelldb commented 10 years ago

Sure, that deals with one case, but there are others, when you remove the data at the backend for instance, this issue is still an issue.

On 27 Jan 2014, at 15:29, Gregory Burd notifications@github.com wrote:

Sounds like we're overload the meaning of notfound and should have a {error, repair_this_key_please} return value from backends.

— Reply to this email directly or view it on GitHub.

engelsanchez commented 10 years ago

Adding wood to the fire:

A scenario related to this discussion painfully surfaced recently and was discussed by the Core/KV cabal. It is possible for hinted handoff to finish much, much later than necessary. See basho/riak_kv#847. That combined with deletes with delete_mode not set to keep, new writes to the delete key followed by an old fallback node finally completlng hinted handoff due to a node restart can lead to old data coming back to life and silently replacing the new data on read repair. Not writing to deleted keys or keeping their tombstones forever would prevent that problem. Alas, that is not the default mode and customers don't like the idea of an ever increasing number of tombstones.

russelldb commented 10 years ago

As @jrwest rightly points out, delete mode never (i.e. don't reap tombstones) is a work around for the doomstone data loss flavour of this bug. Given that is the most likely cause, and the least "byzantine" maybe that is enough for 2.1, with more comprehensive fixes for later versions.

russelldb commented 5 years ago

Erm, I think we could have closed this by now, pretty sure it's dealt with

martinsumner commented 5 years ago

Agreed

basho / riak_kv

Possible data loss on corruption of coordinating node replica #679