basho / riak_dt

Convergent replicated datatypes in Erlang
Apache License 2.0
353 stars 70 forks source link

Fix bugs in Map, Set, and OD Flag #82

Closed russelldb closed 10 years ago

russelldb commented 10 years ago

Dear Reviewer,

Let me know if you want me to break this into separate PRs for Map, Set, Flag.

The common theme here is that the merge functions on all three of these types were broken in a similar way. The Set had implementation bugs around common dots. The Flag and Map were broken by design (though the broken design's impl had the same bugs as the Set.)


The Set changes are all about stopping the following bug (or variations on it):

add x to a
add x to b
rep a to c
rem x from a
rep b to a
rem x from b
merge (b,merge(a,c))

'x' should not be in the set, but the logic used to hold that is 'X' was on both sides of the merge it was in the set. Clearly a value can be on both sides of a merge, but actually be removed, the patch fixes that.


The Flag was broken by design. You need more than a vector to track the current value. The result is a lot like the faulty assumption with the set (if both sides are 'true' then the result is 'true') The test that shows this is incorrect is called disable_test. The re-write of the Flag is pretty self-explanatory: if there are no dots left in the OD Flag's value, it is a false, otherwise it is a true.


The Map problem is rather more complex. As well has having the other bugs above, there is an issue with merging CRDTs during the merge. Looking at the model we used for testing, it only merges CRDTs to a new value on update. It creates a new event that supersedes past events. The problem with our merge function on the Map was that it created this new value, without an event to provoke it. It mixed seen and unseen changes together. Then, when a remove occurred, the values to be removed where bound up with other values and associativity of the merge was broken. Depending on merge order a value could either be in, or out if it got merged in with some data that was concurrent with a remove. The downside to requiring an update event to merge divergent fields, is that you store the field siblings until such an event. So larger maps. I'm yet to quantify exactly how much larger, since the compressed to/from binary seems to take care of all the deduplication. I ran the eqc binary round trip test, hacking it to compare size of the old and new maps, and the new map was usually a little smaller. Go figure. We need to do more testing and benchmarking around this.

I guess I should write a lot more about the new Map design. One thing is for sure, the merge function is a damn sight simpler. And there are still no tombstones.


I wrote new EQCs for the Map and Set. They caught a bunch of bugs that old statem test did not. We need to re-write all the CRDTs to use this smaller-state-space style of test. We also need to factor out the commonality between theses tests. I think at this point just raising issues is enough.

Let me know if you need more information. This is last minute, big change stuff, so tread carefully. But be aware, the Map as it is in develop is unshippable, so we need either this design, or something new.

Good luck

Love

RDB

cmeiklejohn commented 10 years ago

I can help out on this review on Monday.

seancribbs commented 10 years ago

I'm happy to look at it as well, this is critical work.

russelldb commented 10 years ago

Yes, I think all the eyes, please.

seancribbs commented 10 years ago
seancribbs commented 10 years ago

verify_dt_converge hangs on check_3a maps. Can you try it and report back?

russelldb commented 10 years ago

To be fair, this test hangs on develop without my changes, which suggests some other regression / issue. But yeah, let me look into that.

russelldb commented 10 years ago

Actually, cancel that, it only happens with this branch. Oh bum.

seancribbs commented 10 years ago

If the new merge function on map can be factored out a bit for clarity and the riak_test fixed, I'm generally in favor of this change and finished reviewing for now. @cmeiklejohn should probably put his eyeballs on it as well.

russelldb commented 10 years ago

So, the riak test is not going to be fixed without some serious work around operations with contexts. At the same time, what does a riak_kv test have to do with the riak_dt library?

My current view is merge this, get it in the next pre, and advise that folk use include_context=false until we fix that riak_kv issue.

seancribbs commented 10 years ago

Works for me. @cmeiklejohn?

russelldb commented 10 years ago

OK, I dealt with all the review comments, and update the riak_test to use include_context=false for the time being. I also rebased off develop to get Sam's tag work.

Do you want me to a. force push or b. push to a new branch, open a new PR and reference and close one? (cc @cmeiklejohn again, 'cos why not?!)

russelldb commented 10 years ago

See https://github.com/basho/riak_test/pull/550 for riak_test workaround.

seancribbs commented 10 years ago

You can force-push to this branch, if you feel comfortable with it.

russelldb commented 10 years ago

Done.

seancribbs commented 10 years ago

The updated rt is good and works with this branch. You have my :+1: