Closed engelsanchez closed 5 years ago
Yeah. That captures it @engelsanchez. I think this is a problem we introduced recently by treating errors on local read as notfound
. And I think the answer is as simple as creating a new vnodeid
as soon as local corruption is detected.
If for whatever reason a coordinating PUT to a vnode fails to read an existing K/V out of the backend (we have a catch for it here:- https://github.com/basho/riak_kv/blob/develop/src/riak_kv_vnode.erl#L1258) then we have this problem.
@engelsanchez pointed out we have the same problem if a user removes the backend data and not the vnodeid, or removes the backend data while the vnode is running. At this point I'd say all bets are off and there is nothing we can do.
In the case of corruption where we do get an error attempting to read the data, we can do something (get the data from another vnode, generate a new vnodeid.) Ignoring the error and returning a notfound
is not correct though.
Imagine putting to Key K with an empty vclock, the item has been written by vnode A before. Vnode A coordinates the write for this PUT. Vnode A's backend is corrupt due to cosmic rays. Vnode A attempts to read key K to get a local vector clock. Vnode A gets an error but we silently parlay that to a notfound
. Vnode A creates a vclock entry of {A, 1} for the PUT and stores the item (even though the item on disk (that we can't read) has entry {A, 3}).
K's data is returned as a frontier object (not!) to the FSM which sends it downstream to Vnode B and C for local merge. Vnode B and C see that their local vclocks for K dominate the one they just received and drop the write and ack the FSM. The FSMs ack the user.
The user tries to read K. A's replica is dominated by B's and C's, read repair kicks in and A's data is replaced. We lost the data.
If on a read error on a co-ordinating PUT we increment the vnodeId at A, the write would be a sibling of the data at B and C and survive (even if it is a false sibling, it is better than a dropped write.)
@russelldb I don't believe that changing/incrementing the vnodeid is feasible. Even if the backend could tell us 100% of the time that there were corruption, we run the risk of telling us that it noticed corruption 1000 time/second (to pull an arbitrary but not totally silly figure out of the air). There is a different kind of actor explosion happening then, wouldn't it?
Is this totally crazy?
--- src/vclock.erl 2013-09-27 18:38:57.000000000 +0900
+++ /tmp/vclock.erl 2013-10-09 16:23:11.000000000 +0900
@@ -123,12 +123,15 @@
increment(Node, IncTs, VClock) ->
{{_Ctr, _TS}=C1,NewV} = case lists:keytake(Node, 1, VClock) of
false ->
- {{1, IncTs}, VClock};
+ {{big_starting_point(), IncTs}, VClock};
{value, {_N, {C, _T}}, ModV} ->
{{C + 1, IncTs}, ModV}
end,
[{Node,C1}|NewV].
+big_starting_point() ->
+ {X, Y, Z} = os:timestamp(),
+ (((X * 1000000) + Y) * 1000000) + Z. % Microseconds, lots & lots of 'em
% @doc Return the list of all nodes that have ever incremented VClock.
-spec all_nodes(VClock :: vclock()) -> [vclock_node()].
cc: @jtuple @jonmeredith Thoughts?
Just to summarize a long discussion in HipChat:
Per vnode counter. Increment it on every co-ordinated write.
Paranoidly persist it to disk every $THRESHOLD writes (with the vnodeid)
Read it on start up and add $THRESHOLD to it (what if $THRESHOLD changes between start and stop (from conf, say?))
Remove it (like we do with vnodeid) when handoff completes.
(Suffice to say the counter is just like the vnodeid is now as far as persistence goes, except it is persisted more often)
When a local get
on a coordinating put
returns notfound
use the vnodeid counter+1 as the initial counter value in the vector clock (thus ensuring a frontier counter for the coordinating actor)
(Is that it @jtuple, @slfritchie, @jonmeredith ?)
The counter's starting value should be really big, e.g. billions? Otherwise, you're in the same bad place as using the constant 1
like today?
@slfritchie I don't understand why? If there is no counter on disk to read, but there is a vnodeid, then create new vnodeid, otherwise the counter value on disk + threshold means we get a frontier count. Why the billions?
@slfritchie I am starting work on this now, wondering if you could articulate your objection to starting at 1
, or are you cool with it?
The plan above is no good. See riak_kv#726 and https://github.com/basho/riak_test/compare/bug;rdb;gh679 for details.
I think there are a number of possible sources of this behaviour and we should probably partition them a little better.
One question? Why store the vnodeid on disk separate from the backend? If the vnodeid was in the backend then one source of the issue (backend data deleted, vnodeid data is not) would be resolved. I imagine this leads to issues for the memory backend, though.
We could also revert the change that treats an error on local read as a notfound
, the user / client / system can then retry and hit another coordinator.
Then we are left with the case where there is undetectable corruption (only a/some key/value is affected but the vnodeid is still present in the backend) I have no idea how to solve this case at the moment.
I also need to do a little analysis to figure out if there is different behaviour when the PUT has a vclock and when it does not.
If we stick with the solution in riak_kv#726 the problem is manifested as a concurrent write causally dominating a write on disk. If we stick with what we have a write concurrent with data at a replica is accepted and subsequently dropped as it is dominated by what is on disk. Which is worse? Is there another way? I'm wondering if I'm just over thinking a tiny edge case?
cc @jonmeredith @engelsanchez @slfritchie @jtuple
OK, I had a think and hacked together a branch that partially solves the problem.
Since we never compare vclocks across, creating a new vnodeId (based on the vnode Id ++ some epoch) as the actor whenever there is a notfound
on a coordinating put solves the issue.
It raises issues too…but…discuss.
This[1] works* for the case where the local get
is a notfound
and no vclock is provided on the put
[1] https://github.com/basho/riak_kv/compare/bug;rdb;gh679-crazy-ivan
Hi, Russell. I've a couple of worries, ignoring the fact that I still like the big_starting_point()
scheme better. :-)
riak_kv_vnode:write_vnode_status/2
function isn't sufficiently robust. It's outside of the scope of this PR so far, but I recommend that it be fixed. IMO file:write_file/2
isn't sufficient because it doesn't trigger an fsync(2)
system call, so it's possible for a system crash to replace the old file with a truncated new file (probably all the way to zero bytes).file_server_2
process. The latency added by that serialization may be something to worry about.One way around the latency hit would be to do the vnode status update asynchronously:
yo_i_am_done_updating_counter
message to the vnode.yo_...
message.
Sounds like we're overload the meaning of notfound
and should have a {error, repair_this_key_please}
return value from backends.
Sure, that deals with one case, but there are others, when you remove the data at the backend for instance, this issue is still an issue.
On 27 Jan 2014, at 15:29, Gregory Burd notifications@github.com wrote:
Sounds like we're overload the meaning of notfound and should have a {error, repair_this_key_please} return value from backends.
— Reply to this email directly or view it on GitHub.
Adding wood to the fire:
A scenario related to this discussion painfully surfaced recently and was discussed by the Core/KV cabal. It is possible for hinted handoff to finish much, much later than necessary. See basho/riak_kv#847. That combined with deletes with delete_mode not set to keep
, new writes to the delete key followed by an old fallback node finally completlng hinted handoff due to a node restart can lead to old data coming back to life and silently replacing the new data on read repair. Not writing to deleted keys or keeping their tombstones forever would prevent that problem. Alas, that is not the default mode and customers don't like the idea of an ever increasing number of tombstones.
As @jrwest rightly points out, delete mode never (i.e. don't reap tombstones) is a work around for the doomstone data loss flavour of this bug. Given that is the most likely cause, and the least "byzantine" maybe that is enough for 2.1, with more comprehensive fixes for later versions.
Erm, I think we could have closed this by now, pretty sure it's dealt with
Agreed
Russell brought this up today, and perhaps he can elaborate on this:
/cc @jtuple @jrwest @evanmcc @Vagabond @russelldb