Implement implicit merge policies

aboodman commented 9 years ago

After #147 is done, datastore will enforce that all writes be serialized. Boo :(.

To regain some concurrency, we should implement simple automatic merge policies for our datastructures. What I'm imagining is something where we take a pair of concurrent mutations of the tree, along with the previous version of the tree, an consider the three of them recursively, applying some simple rules like:

If any of the three nodes are different types, conflict:
If we are dealing with a set:
- If the same object is both removed and inserted wrt parent: conflict
If we are dealing with a map:
- If the same key is both removed and inserted wrt parent: conflict
- If the same key is inserted wrt parent, but with different values: conflict
If we are dealing with a struct:
- same as map
If we are dealing with a list:
- Same as map but substitute "index" for "key"
- Note that this means that things like "slice" expensive in terms of concurrency

... otherwise, the concurrent modification is allowed.

Another way to look at this (I think) is that the operations we will automatically merge effectively make our datastructures into CRDTs (https://en.wikipedia.org/wiki/Conflict-free_replicated_data_type).

cmasone-attic commented 9 years ago

It seems like mostly what people will do for a list is append, and simultaneous appends would fail according to these rules. Problem is I can see arguments for a lot of different policies...maybe it's fine to just append everything if different values are added wrt parent, but what if the same value is appended by both mutations?

aboodman commented 9 years ago

Yeah I think list is the hardest one.

@rafael-atticlabs said he did some magical stuff that was better than this policy when he was working on a similar thing for Blink. Maybe he can say more about that. I could sort of see extending the policy for the special-case of append to make multiple concurrent appends OK. But I bet raf's thing did that and was more general at same time.

jaseg commented 8 years ago

Since you already had a look at CRDTs: Shapiro et al. wrote down a number of CRDTs with their merge strategies in their "Comprehensive Study of Convergent and Commutative Replicated Data Types". This might contain some inspiration.

I think generally it would make sense to allow the user to specify a merge strategy for a node when creating said node. In some cases, one might want a conflict to be raised on concurrent modification, in other cases e.g. LWW using the committing device's local wall-clock time would be appropriate. LWW would e.g. be a sensible choice when synchronizing multiple devices that all belong to the same user, since it can be assumed that this user is a) somewhat aware of what modifications she did on each device and b) only using one device at a time.

AFAICT adopting this kind of automated merge policy would make noms data structures into state-based CRDTs, with the difference that noms always stores all history. AFAICT there would be two things that a tighter integration of CRDT semantics could bring:

Depending on the data type one might be able to use knowledge about the merge procedure to reduce the amount of data transferred during sync
The automated merge procedure might be used to compute efficient deltas from full states in order to reduce history storage size (effectively transforming historical data from a state-based "CvRDT" to a operation-based "CmRDT").

aboodman commented 8 years ago

@jaseg, thanks for chiming in. I agree with both your points.

I was thinking originally that CRDT research could somehow help us merge set and map mutations better. But I'm not seeing it.

But maybe it could help with list mutations? I want to re-read the section about sequences in that paper.

jaseg commented 8 years ago

What using the CRDTs specified in that paper gives you is at least some correctness guarantee. Conflicts (concurrent modification, concurrent add) must be resolved by some application-specific means, CRDTs cannot help there. In the end you need to consider the specific use case and allow for some application-level merge policy. I thing last-write-wins will be useful in many cases.

As for sequence types, there is the "Replicated Growable Array" specified in section 3.5.1 (p. 35ff) of the "Comprehensive Study". Their model e.g. solves the non-obvious problem that when synchronizing a list across multiple hosts, an integer is not sufficient to describe an element's position in the list.

cmasone-attic commented 8 years ago

After reading some survey papers about CRDTs, I think the semantics of these data structures are generally unfamiliar to most users; e.g. a Set in which elements can only be added, removed, and then never added again (barring some synchronization-requiring “garbage collection” step). We want to provide traditional data structures that programmers are familiar with, even if that means conflict resolution must be done manually in some cases. We think we can provide a few options of conflict resolution policies for each core data structure kind in noms that should meet most needs.

That said, we're still interested in perhaps using some form of sequence crdt for merging lists -- this is still under investigation. The Replicated Growable Array data structure you mention relies on timestamps, which means that it's not robust in the face of clock skew. Without having something like GPS on all nodes, getting reliable timestamps in a distributed system is another whole hard problem to tackle :-) We're still looking at the "continuum"-based sequences in that paper, though.

aboodman commented 8 years ago

Here is a new strawman for list merge. This thinks of the edits people are making to lists as semantically splices, not edits.

This isn't well-specified, but I hope the general idea makes sense.

Given three commits, X, Y, and P (shared parent):

Decide which commit to apply first. We arbitrarily choose the one with the smaller hash. The only thing that matters is that it is deterministic. Call this one C1 and the other one C2.
Derive splices that represent the change from P->C1 and P->C2. Call these S1 and S2.
- Semantically, think of these splices as "delete {num-elements} and insert {new-elements} before element {reference-point}"
- There would have to be a special sentinel reference-point for "end-of-list"
Apply all of S1 to P
Update the reference-points in all of S2 to reflect the changes S1 made, so that they still refer to the correct elements
- e.g., if S1 contains "insert {foo, bar} at 3" and S2 contains "insert {baz} at 4", then S2 would be updated to contain "insert {baz} at 6", because the element at 4 in P is at 6 after S1 is applied.
- Note that this might result in empty splices in the case where, e.g., both commits removed the same elements. That's fine.
- If any of the insertions in S2 have reference points which were removed by S1, conflict.
Apply all the splices in S2.

Note that the way I describe it here is by applying all of S1, then S2, but in real life with large lists, you'd want to do these in parallel. I haven't worked out the algorithm to do that.

Also, for the record, I think you could extend this to be a true CRDT by picking an arbitrary resolution in the case of conflict (e.g., adjusting the position where elements are inserted), but I'm not sure that is desirable.

ghost commented 8 years ago

Right. @aboodman's model of the world matches mine pretty well. I would propose approaching the basic merge algorithm like this:

Given Conflicting Commits, X & Y, let P be the most recent common ancestor Commit.
Let Lx and Ly be respective lists within X and Y that require merging relative to a common ancestor list Lp
Let Sx & Sy be lists of splices which result from processing the list diff from Lp -> Lx and Lp -> Ly, respectively.
Let Lm be Lp
While Sx and Sy are both non-empty
- Let s1 be the first splice in Sx
- Let s2 be the first splice in Sy
- If Overlap(s1, s2)
- If CanMerge(s1, d2)
  - Let s3 be result of Merge(s1, s2)
  - Apply s3 to Lm and adjust remaining splices in Lx and Ly
- Else, Conflict
- Else
- Apply whichever of s1 and s2 has a lower splice index to Lm and adjust remaining splices in Lx and Ly.
Apply remaining splices in either Sx and Sy to Lm

The basic idea here is that list diff will produce an ordered set of splices which transform one list into another. We take two such diffs which transform one initial state into two different final states and process a kind of ordered merge of the splices. By processing the two "front" splices of each diff we can kind of define the problem recursively and reason about merge policies with examples of simple cases of 1 vs 1 splices.

The merge policy we implement will be a function of defining Overlap, CanMerge and Merge.

Note that "adjusting remaining splices" is always the same. The basic idea is that all remaining splice indexes are offset by the net change in length which resulted from applying a splice. For example. If we apply a splice which removed 2 elements and added 3, we would add 1 to the index position of splices remaining to be processed.

Also note that the implementation of this would probably want to process over all the splices to see if any conflicts will arise and only then start producing the merged list, knowing that no conflicts will arise.

ghost commented 8 years ago

Now for the merge policy. I think there is some room for debate here. I propose the following two cases which I hope are uncontreversial

Case 1: 
s: a b c d e // start list
l: a 1 c d e [1, 1, 1] // "left" change, [SpAt, SpRemoved, SpAdded, SpFrom] (see edit_distance.go)
r: a b c 2 e [3, 1, 2] // "right" change
m: a 1 c 3 e // merged list

I.e. I think this is the simplest example of a merge which should probably be allowed in any merge policy

Case 2: 
s: a b c
l: a b 1 c [2, 0, 1]
r: a b 2 c [2, 0, 2]
=> conflict // is the result [a b 1 2 c] or [a b 2 1 c]?

i.e. I think this is the simplest example of a merge conflict. The only way to make the implied "operations" commutative would be to use "external" information, such as commit time (i.e. last write wins) -- which we probably don't want to do with our default merge policy because it should be attempting to merge operations which aren't logical conflicts.

From here it gets more fuzzy. I'm tempted to feel like we should take one of two basic paths (maybe eventually implementing both and allowing for switchable behavior). One path is conservative and the other is liberal.

In order to reason about the policies, here are some cases to consider:

Case 3 (conflicting appends)
a b c
a b c 1 [3, 0, 1]
a b c 2 [3, 0, 2]

Case 4 (delete overlapping, different index)
a b c d
a b c  [3, 1]
a b    [2, 2]

Case 5 (delete same, insert prefix)
a b c
a b 1   [2, 1, 1]
a b 1 2 [2, 1, 1, 2]

Case 6:
a b c d
a b 1 2 [2, 2, 1, 2]
a b c 1 [3, 1, 1]

Case 7 (delete same, no inserts)
a b c
a c [1, 1]
a c [1, 1]

Case 8 (delete same, identical inserts)
a b c
a 1 2 c [1, 1, 1, 2]
a 1 2 c [1, 1, 1, 2]

Case 9
a b c
a 1 2 c [1, 1, 1, 2]
a b 3   [2, 1, 3]

Case 10
a b c
a c     [1, 1]
a 1 b c [1, 0, 1]

ghost commented 8 years ago

@cmasone-attic, @aboodman Why don't you think about these cases some and we can discuss more in person on Thursday.

aboodman commented 8 years ago

Case 2: 
s: a b c
l: a b 1 c [2, 0, 1, 2]
r: a b 2 c [2, 0, 1, 2]
=> conflict // is the result [a b 1 2 c] or [a b 2 1 c]?
i.e. I think this is the simplest example of a merge conflict. The only way to make the implied "operations" commutative would be to use "external" information, such as commit time (i.e. last write wins) -- which we probably don't want to do with our default merge policy because it should be attempting to merge operations which aren't logical conflicts.

I think this should not be a conflict, and the correct answer is: [a b 1 2 c].

The reason why is because:

> l = new noms.List(['a', 'b', 1, 'c'])
> r = new noms.List(['a', 'b', 2, 'c'])
> l.hash.toString()
'2k1hc5s5q0he6uir0a9q4d5817cccb7n'
> r.hash.toString()
'e8v4btj0rb0do1thvi3alh82rpkq1r4v'
> l.hash.toString() < r.hash.toString()
true

... the external information is the hash of the list. Because Noms is awesome, we can calculate the hash efficiently no matter the size of the data. We choose the lower hash arbitrarily.

Case 3 (conflicting appends)
a b c
a b c 1 [3, 0, 1, 3]
a b c 2 [3, 0, 1, 3]

Correct answer is: [a b c 2 1] because new noms.List(['a', 'b', 'c', 1]).hash.toString() < new noms.List(['a', 'b', 'c', 2]) => false.

Case 4:

Let us say l has the lower hash. Splice of r should be adjusted to [2, 1, 0, 0]. Otherwise if r has the lower hash, then splice of l is adjusted to [2, 0, 0, 0] (no-op).

Case 5:

[a b c 1 1 2] :-(

ghost commented 8 years ago

I've updated the comment above to use the more familiar JS syntax for splices (index, removeCount, ...insertItems). I was using the struct version that noms uses internally, but the js syntax is probably easier to discuss.

ghost commented 8 years ago

WRT Case 2 (& Case 3): I see that your proposed semantics allows these cases to be commutative, but it doesn't seem useful to me. In particular it seems to violate the basic semantic of being a list which is that ordering is significant. Picking which element comes first based on the hash of the final states of the lists makes it impossible to reason about how various cases will resolve.

I can see that if you're thinking about importing CSVs into lists and two inserts of new lines at the same position, it's tempting to allow both, but this seems like Set semantics to me, not List.

Or looking at it another way, I think we should probably not risk incorrectly merging two versions of a list whose order is managed by an explicit precedence function. I.e.

s: [{ name: 'bob', age: 20}, {name: 'alice', age: 30}]
l: [{ name: 'bob', age: 20}, {name: 'sam', age: 29}, {name: 'alice', age: 30}]
r: [{ name: 'bob', age: 20}, {name: 'jill', age: 22}, {name: 'alice', age: 30}]

cmasone-attic commented 8 years ago

I'm a strong believer in the principle of least-surprise, which makes me agree with Raf about using the hashes of l and r to decide whose splices take precedent in a merge. As a user of Noms, I'd feel like some opaque thing was foisting unpredictable merge results on me.

I wonder if we should keep the policies simple and, like @arv suggested, provide a resolution callback so that developers can use whatever additional information they want to resolve conflicts.

FWIW, case 7 doesn't seem to me like it needs to be a conflict, but all the others kinda do. Depending on what you're using the list for, I think you'd want to make different merge decisions. Consider @rafael-atticlabs' example above with csvs and Set vs List semantics...I think that there are definitely times when I'd want de-duping like a Set provides, but still be able to enumerate elements in the order they were added, so I'd choose to use a List. Some kind of distributed event-monitoring system, for example.

aboodman commented 8 years ago

I see your points, and I'm fine with starting simpler.

That said I think that for Raf's example of people sorted by age, it would be better to use a map since then noms can do all the merges automatically.

What my proposal basically models is a partially ordered sequence. Something like a distributed event log seems like it would be a good match for that (all we know is that two events happened at "the same" time, not which one really happened first). But I don't have a strong use case for this, so happy to start simple and dumb (I actually don't have many use cases for lists, so even happier to start simple and dumb).

On Thu, Aug 25, 2016 at 9:17 AM, cmasone-attic notifications@github.com wrote:

I'm a strong believer in the principle of least-surprise, which makes me agree with Raf about using the hashes of l and r to decide whose splices take precedent in a merge. As a user of Noms, I'd feel like some opaque thing was foisting unpredictable merge results on me.

I wonder if we should keep the policies simple and, like @arv https://github.com/arv suggested, provide a resolution callback so that developers can use whatever additional information they want to resolve conflicts.

FWIW, case 7 doesn't seem to me like it needs to be a conflict, but all the others kinda do. Depending on what you're using the list for, I think you'd want to make different merge decisions. Consider @rafael-atticlabs https://github.com/rafael-atticlabs' example above with csvs and Set vs List semantics...I think that there are definitely times when I'd want de-duping like a Set provides, but still be able to enumerate elements in the order they were added, so I'd choose to use a List. Some kind of distributed event-monitoring system, for example.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/attic-labs/noms/issues/148#issuecomment-242448058, or mute the thread https://github.com/notifications/unsubscribe-auth/AAE6BBUUHcoatCXwQmP9m0bzPTE02jLDks5qjcAQgaJpZM4FiTyN .

cmasone-attic commented 8 years ago

Discussed at length with @rafael-atticlabs today, and we think we can have a simple algorithm that covers a lot of cases using the framework laid out above. In prose, the plan is to declare a conflict for any splices that overlap, EXCEPT those that are precisely identical -- that is, they start at exactly the same index and add/remove exactly the same items. For non-overlapping splices, apply the one that starts earlier and move on.

let Overlap(s1, s2) be

Overlap(s1, s2) bool {
  earlier, later := s1, s2
  if s2.index < s1.index {
    earlier, later = s2, s1
  }
  return s1.index == s2.index || earlier.index + earlier.removeCount > later.index
}

let CanMerge(s1, s2) be

CanMerge(s1, s2) bool {
  return s1 == s2
}

let Merge(s1, s2) be

Merge(s1, s2) splice {
  return s1
}

ghost commented 8 years ago

To enumerate:

Case 1: a 1 c 3 e
Case 2: => conflict
Case 3: => conflict
Case 4: => conflict
Case 5: => conflict
Case 6: => conflict
Case 7: a c
Case 8: a 1 2 c
Case 9: a 1 2 3
Case 10: => conflict

ghost commented 8 years ago

We can start with this and iterate on including the more exotic cases that kind of "look" mergeable as cases arise.

cmasone-attic commented 8 years ago

Remaining stuff to do before declaring this closed:

[x] Go: Auto-merge concurrent changes to different datasets #2524
[x] JS: Auto-merge concurrent changes to different datasets #2524
[x] Add merge options to Dataset.CommitOptions #2534
[x] Move noms-merge code into noms commit tool #2535
[x] Document final merge semantics somewhere
[x] Remove hacky trivial merge we used to have #2720

aboodman commented 8 years ago

I realize this has bounced around a bit, but I think we should finish it. @cmasone-attic can you find a time sometime soon to close it out?

cmasone-attic commented 8 years ago

oh...right. Now I remember. We haven't implemented ANY merge logic in Javascript. And the merge algorithm is built on Diff...which we also haven't implemented in JS :-)

So...I can implement issue #2534 for Go, and then go finish #2535 no problem. But JS...that's a big chunk of work

aboodman commented 8 years ago

I think it is fine to call this done without auto-merge in js.

ghost commented 8 years ago

Do we have a bug open for implementing all of this in JS? If not, please open one and mark it "helpwanted"

On Wed, Oct 26, 2016 at 11:13 AM, Aaron Boodman notifications@github.com wrote:

I think it is fine to call this done without auto-merge in js.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/attic-labs/noms/issues/148#issuecomment-256432305, or mute the thread https://github.com/notifications/unsubscribe-auth/AMQOHUsR-_LCpkHqlIgASCqjGdZs3kETks5q35hjgaJpZM4FiTyN .

cmasone-attic commented 8 years ago

Issues filed (#2764 & #2765)

cmasone-attic commented 8 years ago

'shbouya

augustoteixeira commented 7 years ago

This issue is stated as "not implemented at all" in the FAQ: https://github.com/attic-labs/noms/blob/master/doc/faq.md

What is the current state of things?

aboodman commented 7 years ago

We need to update the FAQ and the homepage and put together a demo, but it's landed now.

attic-labs / noms

Implement implicit merge policies #148