basho / riak_core

Distributed systems infrastructure used by Riak.
Apache License 2.0
1.23k stars 391 forks source link

Vclock pruning should probably drop the oldest entry first #613

Open russelldb opened 10 years ago

russelldb commented 10 years ago

See https://github.com/basho/riak_core/blob/develop/src/vclock.erl#L251

The logic appears to be:

Sort the vclock so that the youngest entry is first. 

If the clock is longer than small_vclock and it's youngest entry is older than young_vclock and it's longer than big_vclock and it's youngest entry is older than old_vclock we remove the youngest entry from the vclock, and start over again.



This seems wrong, shouldn't we remove the oldest entry?

russelldb commented 10 years ago

More comments

https://github.com/basho/riak_core/blob/develop/src/vclock.erl#L235 is probably incorrect, since it leaves the youngest entry at the head of the list.

See https://github.com/basho/riak_kv/blob/develop/src/riak_kv_vnode.erl#L1180 and https://github.com/basho/riak_kv/blob/develop/src/riak_kv_vnode.erl#L1338 mean pruning is never called on read repair.

So, pruning is only ever called on a coordinated put. The PruneTime is the StartTime from the PUT fsm. And this time is the time of the youngest entry in the clock, since the coord_put just incremented said clock. All of which leaves a vclock entry with a ts of just about riak_core_util:moment() at the head of the list.

Then this https://github.com/basho/riak_core/blob/develop/src/vclock.erl#L243. Now that Now and that HeadTime…aren't they always going to be the same time? So isn't the result of (Now - HeadTime) always going to be 0? And so we return the clock untouched. If somehow this is not true (how, anyone?) Pruning will proceed to trim the newest entry first.

From talking to @bsparrow435 it seems pruning is never really seen on client sites, as they never have vclocks as long as 50, but I still think it is worth fixing this bug, and better documenting the code too.

bsparrow435 commented 10 years ago

I agree with @russelldb. The sort function is not working as it should. Example:

13> V4.
[{<<210,103,72,101,83,39,134,252>>,{1,63573819319}},
 {<<153,84,36,117,83,39,134,197>>,{1,63573819333}},
 {<<238,83,20,227,83,39,134,238>>,{1,63573819372}}]
14> SortV1 = lists:sort(fun({N1,{_,T1}},{N2,{_,T2}}) -> {T1,N1} < {T2,N2} end, V4).
[{<<210,103,72,101,83,39,134,252>>,{1,63573819319}},
 {<<153,84,36,117,83,39,134,197>>,{1,63573819333}},
 {<<238,83,20,227,83,39,134,238>>,{1,63573819372}}]
15> SortV2 = lists:sort(fun({N1,{_,T1}},{N2,{_,T2}}) -> {T1,N1} > {T2,N2} end, V4).
[{<<238,83,20,227,83,39,134,238>>,{1,63573819372}},
 {<<153,84,36,117,83,39,134,197>>,{1,63573819333}},
 {<<210,103,72,101,83,39,134,252>>,{1,63573819319}}]

First sort function result has the newest entry first which will result in the behavior @russelldb described. The function need to be reversed so it returns the oldest entry as the head of the list.

jrwest commented 10 years ago

fwiw, I remember reading through the code about a week ago (after seeing this issue) and drawing the same conclusions. We should either fix it or remove pruning all together if we aren't even regularly hitting the default threshold (iirc pruning was meant to address bloat with client-side vclocks).

russelldb commented 10 years ago

Plan is

  1. Add stats counter so we can see if we ever prune IRL
  2. Fix code to be correct (as per docs)

But none of it is high priority (like 2.0 blocker) since it may never actually be run IRL.

Sound fair?

rzezeski commented 10 years ago

DISCLAIMER: I'm basing the following comments off vague memories from over 2 years ago.

EDIT: I meant to say that client-side ids caused actor explosion, and thus we needed pruning to cull the actor list. This pruning could thus cause superfluous siblings when combined with things like MDC. Also see my next comment.

Pruning probably doesn't matter as much as it did back in the 0.14.x days. Client-side ids would cause massive sibling explosion for some use cases and thus we needed some way to cull the list. But I don't need to tell either of you that :). I made a change to set the small/large size to be the same (50 IIRC) because pruning can cause bugs with MDC. Something like tombstone resurrection and I think perhaps data loss. This goes back to the days when Ian Plosker, who found the bug, was a DA! My change was an attempt to avoid pruning as much as possible without actually removing the feature because we still had rare cases of sibling explosion (e.g. the internal tech talk where the DVV creators showed us sibling explosion even with vnode vclocks). AFAIK, pruning, even if implemented correctly, can still cause issues with MDC. So if there are ways to completely avoid it that would be great. Anyways, just wanted to add some vague historical context in case it helps.

russelldb commented 10 years ago

Interesting. I wasn't aware that pruning and sibling explosion were related. Pruning would actually make siblings more likely. I thought pruning came from actor explosion from client side ids. Brian, above, suggested that it rarely if ever occurs now we have vnode ids. Some CSEs, in some situation, have set young_vclock to -1 to ensure no pruning ever. I imagine this is for some MDC issue, and I'll do some more research on that.

Thanks very much for adding some context. I think this is why adding stats around pruning is essential, so CSEs/Customers can be aware if it even happens.

rzezeski commented 10 years ago

I wasn't aware that pruning and sibling explosion were related. Pruning would actually make siblings more likely. I thought pruning came from actor explosion from client side ids.

Crap, I meant actor explosion, not sibling, sigh.

Here is my small_vclock commit. You can tell how old it is by the fact that I reference bugzilla, haha. In that commit I state that this is to prevent superfluous siblings in MDC, which is true, but I also could swear that Ian found even more insidious behavior than that in some cases. But, I'm working with very fuzzy memory here. Happy hunting :).

https://github.com/basho/riak_kv/commit/cdd6448557e74248d1f3ad8c146de30400e131a6

russelldb commented 10 years ago

heh, thanks again @rzezeski. We're hiring you know :trollface:

bsparrow435 commented 10 years ago

This hit a CS customer over the past weekend and resulted in handoff creating siblings in an infinite loop. Firs time i've seen it but not surprising with CS as the manifest objects are often updated simultaneously and the customer was restarting nodes across the cluster repeatedly during a leave operation.

krestenkrab commented 10 years ago

Is pruning really needed at all since riak dropped client-side id's? Or is this only a recent change, can't remember exactly.

jonmeredith commented 10 years ago

It is code from the age of client-side ids, but to hit it, I suspect they had at least 50 actor ids in there, so something is going on.

On Thu, Oct 23, 2014 at 1:27 PM, Kresten Krab Thorup < notifications@github.com> wrote:

Is pruning really needed at all since riak dropped client-side id's? Or is this only a recent change, can't remember exactly.

— Reply to this email directly or view it on GitHub https://github.com/basho/riak_core/issues/613#issuecomment-60294927.

Jon Meredith Chief Architect Basho Technologies, Inc. jmeredith@basho.com

krestenkrab commented 10 years ago

My experience is that once you hit pruning, the system is likely to go berserk in siblings; most likely because you end up activating pruning on a write-write conflict.

So to avoid the problem, we'd have to mark VCs as pruned - with a "prune stone" (some magic actor id), or recognise them as such because of their length - and in that case do a special filtering on both VCs before comparison by removing entries that are in the (wallclock-wise) old end of the VCs which are not occurring in both VCs being compared - or some such.

Without such special case (heuristic / scientifically incorrect) recognition and logic for pruned VCs, you'd always end up with conflicting writes on all following PUTs. If the given object is a particularly write-contented object, then you drive head-first right into an explosion.

russelldb commented 10 years ago

@krestenkrab this is your experience even with DVV? I thought with DVV we had a handle on sibling explosion.

I think I see what your are advocating, having a pruned vclock prune other clocks (like an epidemic) before it compares with them…but how do you unset the "prune stone" (nice term!) How do you know when you've seen all vclocks for key X and they've all been pruned? Or do you just prune clocks to the same length all the time? Who's length? The longest? The shortest? Some heuristic (max 25?) Oh, I think also, never prune an entry for which you have a dotted value.

If the pruning algorithm wasn't broken (actually removed the oldest, not newest entry), and DVV was in use, I don't think you'd see the issues you describe. But that is just my intuition. I also think we can leverage more than timestamps, with the new key epoch stuff coming as fixes for kv679, we could prune old actor epochs, maybe.

I think a customer issue does bump this up in priority though.

bsparrow435 commented 10 years ago

Vclock from one of the objects with 50 entries:

[{<<3,70,245,231>>,{3,63548129411}},
{<<7,252,201,94>>,{3,63548129481}},
{<<3,141,48,21>>,{6,63548129489}},
{<<3,93,161,119>>,{3,63548129536}},
{<<4,161,224,140>>,{3,63548129543}},
{<<4,145,11,132>>,{3,63548129546}},
{<<5,227,241,143>>,{9,63548129551}},
{<<0,157,249,56>>,{3,63548129559}},
{<<6,184,33,39>>,{3,63548129572}},
{<<4,252,150,54>>,{3,63548129626}},
{<<3,87,106,23>>,{3,63548130844}},
{<<5,240,111,19>>,{3,63548131603}},
{<<193,92,245,25,82,82,166,161>>,{2,63548367525}},
{<<193,92,245,25,82,82,166,169>>,{7,63548367629}},
{<<193,92,245,25,82,83,3,242>>,{1,63548391429}},
{<<193,92,245,25,82,83,4,166>>,{21,63548391636}},
{<<198,49,49,0,82,78,219,179>>,{458816,63551014762}},
{<<47,82,148,53,82,31,146,82>>,{459611,63551014770}},
{<<177,54,1,150,82,31,146,155>>,{461865,63551273842}},
{<<193,92,245,25,82,125,49,139>>,{2934,63551275950}},
{<<157,118,150,76,82,124,251,213>>,{3028,63551449793}},
{<<198,49,49,0,82,129,54,128>>,{1470,63551449806}},
{<<193,92,245,25,82,129,58,117>>,{3437,63551452235}},
{<<198,49,49,0,82,130,94,215>>,{1426,63551656231}},
{<<157,118,150,76,82,130,94,234>>,{1393,63551656233}},
{<<193,92,245,25,82,135,242,139>>,{10,63551860460}},
{<<47,82,148,53,82,131,156,131>>,{3276,63551878459}},
{<<198,49,49,0,82,134,53,205>>,{2385,63552060436}},
{<<198,49,49,0,82,139,174,198>>,{7744,63553169809}},
{<<177,54,1,150,82,133,25,99>>,{24900,63553349120}},
{<<157,118,150,76,82,159,114,19>>,{7971,63554040182}},
{<<177,54,1,150,82,169,234,66>>,{5585,63554482380}},
{<<193,92,245,25,82,138,137,38>>,{36003,63556196399}},
{<<198,49,49,0,82,158,74,88>>,{27052,63556371129}},
{<<177,54,1,150,82,204,72,15>>,{979,63556371147}},
{<<157,118,150,76,82,176,89,226>>,{25664,63556383264}},
{<<157,118,150,76,82,205,138,112>>,{2103,63556459007}},
{<<47,82,148,53,82,208,86,245>>,{2912,63556721113}},
{<<47,82,148,53,82,212,21,175>>,{1506,63556892140}},
{<<157,118,150,76,82,206,248,25>>,{5768,63557667023}},
{<<177,54,1,150,82,225,31,178>>,{76883,63573801491}},
{<<"sx÷`RÍ¿V">>,{146680,63580358731}},
{<<198,49,49,0,82,205,185,85>>,{104950,63580731492}},
{<<198,49,49,0,84,64,123,57>>,{1,63580731562}},
{<<"sx÷`TI;ð">>,{10,63581304846}},
{<<"sx÷`TI<Á">>,{18,63581305123}},
{<<115,120,247,96,84,73,61,141>>,{8,63581305252}},
{<<88,85,164,163,84,38,38,147>>,{33533,63581310328}},
{<<47,82,148,53,84,64,123,101>>,{16033,63581310331}},
{<<177,54,1,150,84,59,178,208>>,{25079,63581311071}}]

It was pretty printed so some was obfuscated to ASCII but you can clearly see 50 unique actor id's. ring_size=256

jonmeredith commented 10 years ago

Hmmm, there's some old school client-side actor ids in there - the 4 byte ones at the start.

Would be interesting to do something like

[{Actor, calendar:gregorian_seconds_to_datetime(When)} || {Actor, {_Count, When}} <- AllThatCrap].

and see the distribution of ages…. is this just an older cluster where ownership has changed enough or had significant enough fallback activity that there are a large number of actor ids.

On Fri, Oct 24, 2014 at 1:14 PM, Brian Sparrow notifications@github.com wrote:

Vclock from one of the objects with 50 entries:

[{<<3,70,245,231>>,{3,63548129411}}, {<<7,252,201,94>>,{3,63548129481}}, {<<3,141,48,21>>,{6,63548129489}}, {<<3,93,161,119>>,{3,63548129536}}, {<<4,161,224,140>>,{3,63548129543}}, {<<4,145,11,132>>,{3,63548129546}}, {<<5,227,241,143>>,{9,63548129551}}, {<<0,157,249,56>>,{3,63548129559}}, {<<6,184,33,39>>,{3,63548129572}}, {<<4,252,150,54>>,{3,63548129626}}, {<<3,87,106,23>>,{3,63548130844}}, {<<5,240,111,19>>,{3,63548131603}}, {<<193,92,245,25,82,82,166,161>>,{2,63548367525}}, {<<193,92,245,25,82,82,166,169>>,{7,63548367629}}, {<<193,92,245,25,82,83,3,242>>,{1,63548391429}}, {<<193,92,245,25,82,83,4,166>>,{21,63548391636}}, {<<198,49,49,0,82,78,219,179>>,{458816,63551014762}}, {<<47,82,148,53,82,31,146,82>>,{459611,63551014770}}, {<<177,54,1,150,82,31,146,155>>,{461865,63551273842}}, {<<193,92,245,25,82,125,49,139>>,{2934,63551275950}}, {<<157,118,150,76,82,124,251,213>>,{3028,63551449793}}, {<<198,49,49,0,82,129,54,128>>,{1470,63551449806}}, {<<193,92,245,25,82,129,58,117>>,{3437,63551452235}}, {<<198,49,49,0,82,130,94,215>>,{1426,63551656231}}, {<<157,118,150,76,82,130,94,234>>,{1393,63551656233}}, {<<193,92,245,25,82,135,242,139>>,{10,63551860460}}, {<<47,82,148,53,82,131,156,131>>,{3276,63551878459}}, {<<198,49,49,0,82,134,53,205>>,{2385,63552060436}}, {<<198,49,49,0,82,139,174,198>>,{7744,63553169809}}, {<<177,54,1,150,82,133,25,99>>,{24900,63553349120}}, {<<157,118,150,76,82,159,114,19>>,{7971,63554040182}}, {<<177,54,1,150,82,169,234,66>>,{5585,63554482380}}, {<<193,92,245,25,82,138,137,38>>,{36003,63556196399}}, {<<198,49,49,0,82,158,74,88>>,{27052,63556371129}}, {<<177,54,1,150,82,204,72,15>>,{979,63556371147}}, {<<157,118,150,76,82,176,89,226>>,{25664,63556383264}}, {<<157,118,150,76,82,205,138,112>>,{2103,63556459007}}, {<<47,82,148,53,82,208,86,245>>,{2912,63556721113}}, {<<47,82,148,53,82,212,21,175>>,{1506,63556892140}}, {<<157,118,150,76,82,206,248,25>>,{5768,63557667023}}, {<<177,54,1,150,82,225,31,178>>,{76883,63573801491}}, {<<"sx÷RÍ¿V">>,{146680,63580358731}}, {<<198,49,49,0,82,205,185,85>>,{104950,63580731492}}, {<<198,49,49,0,84,64,123,57>>,{1,63580731562}}, {<<"sx÷TI;ð">>,{10,63581304846}}, {<<"sx÷`TI<Á">>,{18,63581305123}}, {<<115,120,247,96,84,73,61,141>>,{8,63581305252}}, {<<88,85,164,163,84,38,38,147>>,{33533,63581310328}}, {<<47,82,148,53,84,64,123,101>>,{16033,63581310331}}, {<<177,54,1,150,84,59,178,208>>,{25079,63581311071}}]

It was pretty printed so some was obfuscated to ASCII but you can clearly see 50 unique actor id's. ring_size=256

— Reply to this email directly or view it on GitHub https://github.com/basho/riak_core/issues/613#issuecomment-60435581.

Jon Meredith Chief Architect Basho Technologies, Inc. jmeredith@basho.com

bsparrow435 commented 10 years ago

The earliest version of this cluster was 1.4.2. Now on 1.4.10.

bsparrow435 commented 10 years ago
[{<<3,70,245,231>>,{{2013,10,4},{18,10,11}}},
 {<<7,252,201,94>>,{{2013,10,4},{18,11,21}}},
 {<<3,141,48,21>>,{{2013,10,4},{18,11,29}}},
 {<<3,93,161,119>>,{{2013,10,4},{18,12,16}}},
 {<<4,161,224,140>>,{{2013,10,4},{18,12,23}}},
 {<<4,145,11,132>>,{{2013,10,4},{18,12,26}}},
 {<<5,227,241,143>>,{{2013,10,4},{18,12,31}}},
 {<<0,157,249,56>>,{{2013,10,4},{18,12,39}}},
 {<<6,184,33,39>>,{{2013,10,4},{18,12,52}}},
 {<<4,252,150,54>>,{{2013,10,4},{18,13,46}}},
 {<<3,87,106,23>>,{{2013,10,4},{18,34,4}}},
 {<<5,240,111,19>>,{{2013,10,4},{18,46,43}}},
 {<<193,92,245,25,82,82,166,161>>,{{2013,10,7},{12,18,45}}},
 {<<193,92,245,25,82,82,166,169>>,{{2013,10,7},{12,20,29}}},
 {<<193,92,245,25,82,83,3,242>>,{{2013,10,7},{18,57,9}}},
 {<<193,92,245,25,82,83,4,166>>,{{2013,10,7},{19,0,36}}},
 {<<198,49,49,0,82,78,219,179>>,{{2013,11,7},{3,39,22}}},
 {<<47,82,148,53,82,31,146,82>>,{{2013,11,7},{3,39,30}}},
 {<<177,54,1,150,82,31,146,155>>,{{2013,11,10},{3,37,22}}},
 {<<193,92,245,25,82,125,49,139>>,{{2013,11,10},{4,12,30}}},
 {<<157,118,150,76,82,124,251,213>>,{{2013,11,12},{4,29,53}}},
 {<<198,49,49,0,82,129,54,128>>,{{2013,11,12},{4,30,6}}},
 {<<193,92,245,25,82,129,58,117>>,{{2013,11,12},{5,10,35}}},
 {<<198,49,49,0,82,130,94,215>>,{{2013,11,14},{13,50,31}}},
 {<<157,118,150,76,82,130,94,234>>,{{2013,11,14},{13,50,33}}},
 {<<193,92,245,25,82,135,242,139>>,{{2013,11,16},{22,34,20}}},
 {<<47,82,148,53,82,131,156,131>>,{{2013,11,17},{3,34,19}}},
 {<<198,49,49,0,82,134,53,205>>,{{2013,11,19},{6,7,16}}},
 {<<198,49,49,0,82,139,174,198>>,{{2013,12,2},{2,16,49}}},
 {<<177,54,1,150,82,133,25,99>>,{{2013,12,4},{4,5,20}}},
 {<<157,118,150,76,82,159,114,19>>,{{2013,12,12},{4,3,2}}},
 {<<177,54,1,150,82,169,234,66>>,{{2013,12,17},{6,53,0}}},
 {<<193,92,245,25,82,138,137,38>>,{{2014,1,6},{2,59,59}}},
 {<<198,49,49,0,82,158,74,88>>,{{2014,1,8},{3,32,9}}},
 {<<177,54,1,150,82,204,72,15>>,{{2014,1,8},{3,32,27}}},
 {<<157,118,150,76,82,176,89,226>>,{{2014,1,8},{6,54,24}}},
 {<<157,118,150,76,82,205,138,112>>,{{2014,1,9},{3,56,47}}},
 {<<47,82,148,53,82,208,86,245>>,{{2014,1,12},{4,45,13}}},
 {<<47,82,148,53,82,212,21,175>>,{{2014,1,14},{4,15,40}}},
 {<<157,118,150,76,82,206,248,25>>,{{2014,1,23},{3,30,23}}},
 {<<177,54,1,150,82,225,31,178>>,{{2014,7,28},{21,18,11}}},
 {<<"sx÷`RÍ¿V">>,{{2014,10,12},{18,45,31}}},
 {<<198,49,49,0,82,205,185,85>>,{{2014,10,17},{2,18,12}}},
 {<<198,49,49,0,84,64,123,57>>,{{2014,10,17},{2,19,22}}},
 {<<"sx÷`TI;ð">>,{{2014,10,23},{17,34,6}}},
 {<<"sx÷`TI<Á">>,{{2014,10,23},{17,38,43}}},
 {<<115,120,247,96,84,73,61,141>>,{{2014,10,23},{17,40,52}}},
 {<<88,85,164,163,84,38,38,147>>,{{2014,10,23},{19,5,28}}},
 {<<47,82,148,53,84,64,123,101>>,{{2014,10,23},{19,5,31}}},
 {<<177,54,1,150,84,59,178,208>>,{{2014,10,23},{19,17,51}}}]

This cluster has been active since mid 2013. It has gone through a couple node additions but has been restarted many times as well.

russelldb commented 10 years ago

What's the n-val? That's a lot of actors. Are they on a cloud provider?

bsparrow435 commented 10 years ago

n_val = 3 RiakCS cluster. Dedicated hardware in their own datacenter. Current cluster size is 6 nodes.

They have a streaming video use case so they update a small number of objects once every 2 seconds, another small group of objects once every 10 seconds, and then all other objects are written once and then read many times.

This vclock is from one of the 2 second update objects.

kellymclaughlin commented 10 years ago

Then this https://github.com/basho/riak_core/blob/develop/src/vclock.erl#L243. Now that Now and that HeadTime…aren't they always going to be the same time? So isn't the result of (Now - HeadTime) always going to be 0? And so we return the clock untouched. If somehow this is not true (how, anyone?) Pruning will proceed to trim the newest entry first.

I was playing around with a 3 node devrel cluster and a bucket with small_vclock set to 1 and big_vclock set to 2 and I definitey see cases where Now - HeadTime is not zero and is enough to trigger pruning. After that things are a mess as far as sibling count when concurent writes are happening.

Basho-JIRA commented 9 years ago

Pretty sure this is a non-issue after all. Maybe some small investigation task is needed to confirm, but I think the original report was a "brain fart", and in fact, the oldest are dropped first (since the oldest timestamps are the smallest.)

_[posted via JIRA by Russell Brown]_