Open russelldb opened 10 years ago
More comments
https://github.com/basho/riak_core/blob/develop/src/vclock.erl#L235 is probably incorrect, since it leaves the youngest entry at the head of the list.
See https://github.com/basho/riak_kv/blob/develop/src/riak_kv_vnode.erl#L1180 and https://github.com/basho/riak_kv/blob/develop/src/riak_kv_vnode.erl#L1338 mean pruning is never called on read repair
.
So, pruning is only ever called on a coordinated put. The PruneTime
is the StartTime
from the PUT fsm. And this time is the time of the youngest entry in the clock, since the coord_put
just incremented said clock. All of which leaves a vclock entry with a ts
of just about riak_core_util:moment()
at the head of the list.
Then this https://github.com/basho/riak_core/blob/develop/src/vclock.erl#L243. Now that Now
and that HeadTime
…aren't they always going to be the same time? So isn't the result of (Now - HeadTime)
always going to be 0
? And so we return the clock untouched. If somehow this is not true (how, anyone?) Pruning will proceed to trim the newest entry first.
From talking to @bsparrow435 it seems pruning is never really seen on client sites, as they never have vclocks as long as 50
, but I still think it is worth fixing this bug, and better documenting the code too.
I agree with @russelldb. The sort function is not working as it should. Example:
13> V4.
[{<<210,103,72,101,83,39,134,252>>,{1,63573819319}},
{<<153,84,36,117,83,39,134,197>>,{1,63573819333}},
{<<238,83,20,227,83,39,134,238>>,{1,63573819372}}]
14> SortV1 = lists:sort(fun({N1,{_,T1}},{N2,{_,T2}}) -> {T1,N1} < {T2,N2} end, V4).
[{<<210,103,72,101,83,39,134,252>>,{1,63573819319}},
{<<153,84,36,117,83,39,134,197>>,{1,63573819333}},
{<<238,83,20,227,83,39,134,238>>,{1,63573819372}}]
15> SortV2 = lists:sort(fun({N1,{_,T1}},{N2,{_,T2}}) -> {T1,N1} > {T2,N2} end, V4).
[{<<238,83,20,227,83,39,134,238>>,{1,63573819372}},
{<<153,84,36,117,83,39,134,197>>,{1,63573819333}},
{<<210,103,72,101,83,39,134,252>>,{1,63573819319}}]
First sort function result has the newest entry first which will result in the behavior @russelldb described. The function need to be reversed so it returns the oldest entry as the head of the list.
fwiw, I remember reading through the code about a week ago (after seeing this issue) and drawing the same conclusions. We should either fix it or remove pruning all together if we aren't even regularly hitting the default threshold (iirc pruning was meant to address bloat with client-side vclocks).
Plan is
But none of it is high priority (like 2.0 blocker) since it may never actually be run IRL.
Sound fair?
DISCLAIMER: I'm basing the following comments off vague memories from over 2 years ago.
EDIT: I meant to say that client-side ids caused actor explosion, and thus we needed pruning to cull the actor list. This pruning could thus cause superfluous siblings when combined with things like MDC. Also see my next comment.
Pruning probably doesn't matter as much as it did back in the 0.14.x days. Client-side ids would cause massive sibling explosion for some use cases and thus we needed some way to cull the list. But I don't need to tell either of you that :). I made a change to set the small/large size to be the same (50 IIRC) because pruning can cause bugs with MDC. Something like tombstone resurrection and I think perhaps data loss. This goes back to the days when Ian Plosker, who found the bug, was a DA! My change was an attempt to avoid pruning as much as possible without actually removing the feature because we still had rare cases of sibling explosion (e.g. the internal tech talk where the DVV creators showed us sibling explosion even with vnode vclocks). AFAIK, pruning, even if implemented correctly, can still cause issues with MDC. So if there are ways to completely avoid it that would be great. Anyways, just wanted to add some vague historical context in case it helps.
Interesting. I wasn't aware that pruning and sibling explosion were related. Pruning would actually make siblings more likely. I thought pruning came from actor explosion from client side ids. Brian, above, suggested that it rarely if ever occurs now we have vnode ids. Some CSEs, in some situation, have set young_vclock
to -1
to ensure no pruning ever. I imagine this is for some MDC issue, and I'll do some more research on that.
Thanks very much for adding some context. I think this is why adding stats around pruning is essential, so CSEs/Customers can be aware if it even happens.
I wasn't aware that pruning and sibling explosion were related. Pruning would actually make siblings more likely. I thought pruning came from actor explosion from client side ids.
Crap, I meant actor explosion, not sibling, sigh.
Here is my small_vclock commit. You can tell how old it is by the fact that I reference bugzilla, haha. In that commit I state that this is to prevent superfluous siblings in MDC, which is true, but I also could swear that Ian found even more insidious behavior than that in some cases. But, I'm working with very fuzzy memory here. Happy hunting :).
https://github.com/basho/riak_kv/commit/cdd6448557e74248d1f3ad8c146de30400e131a6
heh, thanks again @rzezeski. We're hiring you know :trollface:
This hit a CS customer over the past weekend and resulted in handoff creating siblings in an infinite loop. Firs time i've seen it but not surprising with CS as the manifest objects are often updated simultaneously and the customer was restarting nodes across the cluster repeatedly during a leave operation.
Is pruning really needed at all since riak dropped client-side id's? Or is this only a recent change, can't remember exactly.
It is code from the age of client-side ids, but to hit it, I suspect they had at least 50 actor ids in there, so something is going on.
On Thu, Oct 23, 2014 at 1:27 PM, Kresten Krab Thorup < notifications@github.com> wrote:
Is pruning really needed at all since riak dropped client-side id's? Or is this only a recent change, can't remember exactly.
— Reply to this email directly or view it on GitHub https://github.com/basho/riak_core/issues/613#issuecomment-60294927.
Jon Meredith Chief Architect Basho Technologies, Inc. jmeredith@basho.com
My experience is that once you hit pruning, the system is likely to go berserk in siblings; most likely because you end up activating pruning on a write-write conflict.
So to avoid the problem, we'd have to mark VCs as pruned - with a "prune stone" (some magic actor id), or recognise them as such because of their length - and in that case do a special filtering on both VCs before comparison by removing entries that are in the (wallclock-wise) old end of the VCs which are not occurring in both VCs being compared - or some such.
Without such special case (heuristic / scientifically incorrect) recognition and logic for pruned VCs, you'd always end up with conflicting writes on all following PUTs. If the given object is a particularly write-contented object, then you drive head-first right into an explosion.
@krestenkrab this is your experience even with DVV? I thought with DVV we had a handle on sibling explosion.
I think I see what your are advocating, having a pruned vclock prune other clocks (like an epidemic) before it compares with them…but how do you unset the "prune stone" (nice term!) How do you know when you've seen all vclocks for key X and they've all been pruned? Or do you just prune clocks to the same length all the time? Who's length? The longest? The shortest? Some heuristic (max 25?) Oh, I think also, never prune an entry for which you have a dotted value.
If the pruning algorithm wasn't broken (actually removed the oldest, not newest entry), and DVV was in use, I don't think you'd see the issues you describe. But that is just my intuition. I also think we can leverage more than timestamps, with the new key epoch stuff coming as fixes for kv679, we could prune old actor epochs, maybe.
I think a customer issue does bump this up in priority though.
Vclock from one of the objects with 50 entries:
[{<<3,70,245,231>>,{3,63548129411}},
{<<7,252,201,94>>,{3,63548129481}},
{<<3,141,48,21>>,{6,63548129489}},
{<<3,93,161,119>>,{3,63548129536}},
{<<4,161,224,140>>,{3,63548129543}},
{<<4,145,11,132>>,{3,63548129546}},
{<<5,227,241,143>>,{9,63548129551}},
{<<0,157,249,56>>,{3,63548129559}},
{<<6,184,33,39>>,{3,63548129572}},
{<<4,252,150,54>>,{3,63548129626}},
{<<3,87,106,23>>,{3,63548130844}},
{<<5,240,111,19>>,{3,63548131603}},
{<<193,92,245,25,82,82,166,161>>,{2,63548367525}},
{<<193,92,245,25,82,82,166,169>>,{7,63548367629}},
{<<193,92,245,25,82,83,3,242>>,{1,63548391429}},
{<<193,92,245,25,82,83,4,166>>,{21,63548391636}},
{<<198,49,49,0,82,78,219,179>>,{458816,63551014762}},
{<<47,82,148,53,82,31,146,82>>,{459611,63551014770}},
{<<177,54,1,150,82,31,146,155>>,{461865,63551273842}},
{<<193,92,245,25,82,125,49,139>>,{2934,63551275950}},
{<<157,118,150,76,82,124,251,213>>,{3028,63551449793}},
{<<198,49,49,0,82,129,54,128>>,{1470,63551449806}},
{<<193,92,245,25,82,129,58,117>>,{3437,63551452235}},
{<<198,49,49,0,82,130,94,215>>,{1426,63551656231}},
{<<157,118,150,76,82,130,94,234>>,{1393,63551656233}},
{<<193,92,245,25,82,135,242,139>>,{10,63551860460}},
{<<47,82,148,53,82,131,156,131>>,{3276,63551878459}},
{<<198,49,49,0,82,134,53,205>>,{2385,63552060436}},
{<<198,49,49,0,82,139,174,198>>,{7744,63553169809}},
{<<177,54,1,150,82,133,25,99>>,{24900,63553349120}},
{<<157,118,150,76,82,159,114,19>>,{7971,63554040182}},
{<<177,54,1,150,82,169,234,66>>,{5585,63554482380}},
{<<193,92,245,25,82,138,137,38>>,{36003,63556196399}},
{<<198,49,49,0,82,158,74,88>>,{27052,63556371129}},
{<<177,54,1,150,82,204,72,15>>,{979,63556371147}},
{<<157,118,150,76,82,176,89,226>>,{25664,63556383264}},
{<<157,118,150,76,82,205,138,112>>,{2103,63556459007}},
{<<47,82,148,53,82,208,86,245>>,{2912,63556721113}},
{<<47,82,148,53,82,212,21,175>>,{1506,63556892140}},
{<<157,118,150,76,82,206,248,25>>,{5768,63557667023}},
{<<177,54,1,150,82,225,31,178>>,{76883,63573801491}},
{<<"sx÷`RÍ¿V">>,{146680,63580358731}},
{<<198,49,49,0,82,205,185,85>>,{104950,63580731492}},
{<<198,49,49,0,84,64,123,57>>,{1,63580731562}},
{<<"sx÷`TI;ð">>,{10,63581304846}},
{<<"sx÷`TI<Á">>,{18,63581305123}},
{<<115,120,247,96,84,73,61,141>>,{8,63581305252}},
{<<88,85,164,163,84,38,38,147>>,{33533,63581310328}},
{<<47,82,148,53,84,64,123,101>>,{16033,63581310331}},
{<<177,54,1,150,84,59,178,208>>,{25079,63581311071}}]
It was pretty printed so some was obfuscated to ASCII but you can clearly see 50 unique actor id's. ring_size=256
Hmmm, there's some old school client-side actor ids in there - the 4 byte ones at the start.
Would be interesting to do something like
[{Actor, calendar:gregorian_seconds_to_datetime(When)} || {Actor, {_Count, When}} <- AllThatCrap].
and see the distribution of ages…. is this just an older cluster where ownership has changed enough or had significant enough fallback activity that there are a large number of actor ids.
On Fri, Oct 24, 2014 at 1:14 PM, Brian Sparrow notifications@github.com wrote:
Vclock from one of the objects with 50 entries:
[{<<3,70,245,231>>,{3,63548129411}}, {<<7,252,201,94>>,{3,63548129481}}, {<<3,141,48,21>>,{6,63548129489}}, {<<3,93,161,119>>,{3,63548129536}}, {<<4,161,224,140>>,{3,63548129543}}, {<<4,145,11,132>>,{3,63548129546}}, {<<5,227,241,143>>,{9,63548129551}}, {<<0,157,249,56>>,{3,63548129559}}, {<<6,184,33,39>>,{3,63548129572}}, {<<4,252,150,54>>,{3,63548129626}}, {<<3,87,106,23>>,{3,63548130844}}, {<<5,240,111,19>>,{3,63548131603}}, {<<193,92,245,25,82,82,166,161>>,{2,63548367525}}, {<<193,92,245,25,82,82,166,169>>,{7,63548367629}}, {<<193,92,245,25,82,83,3,242>>,{1,63548391429}}, {<<193,92,245,25,82,83,4,166>>,{21,63548391636}}, {<<198,49,49,0,82,78,219,179>>,{458816,63551014762}}, {<<47,82,148,53,82,31,146,82>>,{459611,63551014770}}, {<<177,54,1,150,82,31,146,155>>,{461865,63551273842}}, {<<193,92,245,25,82,125,49,139>>,{2934,63551275950}}, {<<157,118,150,76,82,124,251,213>>,{3028,63551449793}}, {<<198,49,49,0,82,129,54,128>>,{1470,63551449806}}, {<<193,92,245,25,82,129,58,117>>,{3437,63551452235}}, {<<198,49,49,0,82,130,94,215>>,{1426,63551656231}}, {<<157,118,150,76,82,130,94,234>>,{1393,63551656233}}, {<<193,92,245,25,82,135,242,139>>,{10,63551860460}}, {<<47,82,148,53,82,131,156,131>>,{3276,63551878459}}, {<<198,49,49,0,82,134,53,205>>,{2385,63552060436}}, {<<198,49,49,0,82,139,174,198>>,{7744,63553169809}}, {<<177,54,1,150,82,133,25,99>>,{24900,63553349120}}, {<<157,118,150,76,82,159,114,19>>,{7971,63554040182}}, {<<177,54,1,150,82,169,234,66>>,{5585,63554482380}}, {<<193,92,245,25,82,138,137,38>>,{36003,63556196399}}, {<<198,49,49,0,82,158,74,88>>,{27052,63556371129}}, {<<177,54,1,150,82,204,72,15>>,{979,63556371147}}, {<<157,118,150,76,82,176,89,226>>,{25664,63556383264}}, {<<157,118,150,76,82,205,138,112>>,{2103,63556459007}}, {<<47,82,148,53,82,208,86,245>>,{2912,63556721113}}, {<<47,82,148,53,82,212,21,175>>,{1506,63556892140}}, {<<157,118,150,76,82,206,248,25>>,{5768,63557667023}}, {<<177,54,1,150,82,225,31,178>>,{76883,63573801491}}, {<<"sx÷
RÍ¿V">>,{146680,63580358731}}, {<<198,49,49,0,82,205,185,85>>,{104950,63580731492}}, {<<198,49,49,0,84,64,123,57>>,{1,63580731562}}, {<<"sx÷
TI;ð">>,{10,63581304846}}, {<<"sx÷`TI<Á">>,{18,63581305123}}, {<<115,120,247,96,84,73,61,141>>,{8,63581305252}}, {<<88,85,164,163,84,38,38,147>>,{33533,63581310328}}, {<<47,82,148,53,84,64,123,101>>,{16033,63581310331}}, {<<177,54,1,150,84,59,178,208>>,{25079,63581311071}}]It was pretty printed so some was obfuscated to ASCII but you can clearly see 50 unique actor id's. ring_size=256
— Reply to this email directly or view it on GitHub https://github.com/basho/riak_core/issues/613#issuecomment-60435581.
Jon Meredith Chief Architect Basho Technologies, Inc. jmeredith@basho.com
The earliest version of this cluster was 1.4.2. Now on 1.4.10.
[{<<3,70,245,231>>,{{2013,10,4},{18,10,11}}},
{<<7,252,201,94>>,{{2013,10,4},{18,11,21}}},
{<<3,141,48,21>>,{{2013,10,4},{18,11,29}}},
{<<3,93,161,119>>,{{2013,10,4},{18,12,16}}},
{<<4,161,224,140>>,{{2013,10,4},{18,12,23}}},
{<<4,145,11,132>>,{{2013,10,4},{18,12,26}}},
{<<5,227,241,143>>,{{2013,10,4},{18,12,31}}},
{<<0,157,249,56>>,{{2013,10,4},{18,12,39}}},
{<<6,184,33,39>>,{{2013,10,4},{18,12,52}}},
{<<4,252,150,54>>,{{2013,10,4},{18,13,46}}},
{<<3,87,106,23>>,{{2013,10,4},{18,34,4}}},
{<<5,240,111,19>>,{{2013,10,4},{18,46,43}}},
{<<193,92,245,25,82,82,166,161>>,{{2013,10,7},{12,18,45}}},
{<<193,92,245,25,82,82,166,169>>,{{2013,10,7},{12,20,29}}},
{<<193,92,245,25,82,83,3,242>>,{{2013,10,7},{18,57,9}}},
{<<193,92,245,25,82,83,4,166>>,{{2013,10,7},{19,0,36}}},
{<<198,49,49,0,82,78,219,179>>,{{2013,11,7},{3,39,22}}},
{<<47,82,148,53,82,31,146,82>>,{{2013,11,7},{3,39,30}}},
{<<177,54,1,150,82,31,146,155>>,{{2013,11,10},{3,37,22}}},
{<<193,92,245,25,82,125,49,139>>,{{2013,11,10},{4,12,30}}},
{<<157,118,150,76,82,124,251,213>>,{{2013,11,12},{4,29,53}}},
{<<198,49,49,0,82,129,54,128>>,{{2013,11,12},{4,30,6}}},
{<<193,92,245,25,82,129,58,117>>,{{2013,11,12},{5,10,35}}},
{<<198,49,49,0,82,130,94,215>>,{{2013,11,14},{13,50,31}}},
{<<157,118,150,76,82,130,94,234>>,{{2013,11,14},{13,50,33}}},
{<<193,92,245,25,82,135,242,139>>,{{2013,11,16},{22,34,20}}},
{<<47,82,148,53,82,131,156,131>>,{{2013,11,17},{3,34,19}}},
{<<198,49,49,0,82,134,53,205>>,{{2013,11,19},{6,7,16}}},
{<<198,49,49,0,82,139,174,198>>,{{2013,12,2},{2,16,49}}},
{<<177,54,1,150,82,133,25,99>>,{{2013,12,4},{4,5,20}}},
{<<157,118,150,76,82,159,114,19>>,{{2013,12,12},{4,3,2}}},
{<<177,54,1,150,82,169,234,66>>,{{2013,12,17},{6,53,0}}},
{<<193,92,245,25,82,138,137,38>>,{{2014,1,6},{2,59,59}}},
{<<198,49,49,0,82,158,74,88>>,{{2014,1,8},{3,32,9}}},
{<<177,54,1,150,82,204,72,15>>,{{2014,1,8},{3,32,27}}},
{<<157,118,150,76,82,176,89,226>>,{{2014,1,8},{6,54,24}}},
{<<157,118,150,76,82,205,138,112>>,{{2014,1,9},{3,56,47}}},
{<<47,82,148,53,82,208,86,245>>,{{2014,1,12},{4,45,13}}},
{<<47,82,148,53,82,212,21,175>>,{{2014,1,14},{4,15,40}}},
{<<157,118,150,76,82,206,248,25>>,{{2014,1,23},{3,30,23}}},
{<<177,54,1,150,82,225,31,178>>,{{2014,7,28},{21,18,11}}},
{<<"sx÷`RÍ¿V">>,{{2014,10,12},{18,45,31}}},
{<<198,49,49,0,82,205,185,85>>,{{2014,10,17},{2,18,12}}},
{<<198,49,49,0,84,64,123,57>>,{{2014,10,17},{2,19,22}}},
{<<"sx÷`TI;ð">>,{{2014,10,23},{17,34,6}}},
{<<"sx÷`TI<Á">>,{{2014,10,23},{17,38,43}}},
{<<115,120,247,96,84,73,61,141>>,{{2014,10,23},{17,40,52}}},
{<<88,85,164,163,84,38,38,147>>,{{2014,10,23},{19,5,28}}},
{<<47,82,148,53,84,64,123,101>>,{{2014,10,23},{19,5,31}}},
{<<177,54,1,150,84,59,178,208>>,{{2014,10,23},{19,17,51}}}]
This cluster has been active since mid 2013. It has gone through a couple node additions but has been restarted many times as well.
What's the n-val? That's a lot of actors. Are they on a cloud provider?
n_val = 3 RiakCS cluster. Dedicated hardware in their own datacenter. Current cluster size is 6 nodes.
They have a streaming video use case so they update a small number of objects once every 2 seconds, another small group of objects once every 10 seconds, and then all other objects are written once and then read many times.
This vclock is from one of the 2 second update objects.
Then this https://github.com/basho/riak_core/blob/develop/src/vclock.erl#L243. Now that Now and that HeadTime…aren't they always going to be the same time? So isn't the result of (Now - HeadTime) always going to be 0? And so we return the clock untouched. If somehow this is not true (how, anyone?) Pruning will proceed to trim the newest entry first.
I was playing around with a 3 node devrel cluster and a bucket with small_vclock
set to 1 and big_vclock
set to 2 and I definitey see cases where Now - HeadTime
is not zero and is enough to trigger pruning. After that things are a mess as far as sibling count when concurent writes are happening.
Pretty sure this is a non-issue after all. Maybe some small investigation task is needed to confirm, but I think the original report was a "brain fart", and in fact, the oldest are dropped first (since the oldest timestamps are the smallest.)
_[posted via JIRA by Russell Brown]_
See https://github.com/basho/riak_core/blob/develop/src/vclock.erl#L251
The logic appears to be:
Sort the vclock so that the youngest entry is first. If the clock is longer than
small_vclock
and it's youngest entry is older thanyoung_vclock
and it's longer thanbig_vclock
and it's youngest entry is older thanold_vclock
we remove the youngest entry from the vclock, and start over again.This seems wrong, shouldn't we remove the oldest entry?