automerge / automerge-classic

A JSON-like data structure (a CRDT) that can be modified concurrently by different users, and merged again automatically.
http://automerge.org/
MIT License
14.75k stars 466 forks source link

Integration tests for sync protocol #352

Closed HerbCaudill closed 4 months ago

HerbCaudill commented 3 years ago

In order to understand the sync protocol I made some more tests, using the protocol as an application might, rather than exercising its internals.

All of these tests pass except for the last one, which involves three peers, each making concurrent changes while disconnected. I get into a loop in which two of the peers keep sending each other their version without converging. (see https://github.com/automerge/automerge/pull/339#issuecomment-832681772 ) I'm not sure if the problem is with my helper code or with the sync protocol itself.

HerbCaudill commented 3 years ago

At Daniel Weaver's suggestion I've removed the try/catches that swallow Automerge's "outdated doc" error, on the hunch that they are symptoms of the issue I'm having; so now a couple more of the 3-peer tests fail.

ept commented 3 years ago

I have started working through this, but I must admit I find it quite hard to follow what is going on with all the EventEmitters, as they result in very indirect control flow. I wonder if it might be easier to understand the code if one peer sending a message just directly calls a function on the peer that is receiving.

pvh commented 3 years ago

I should note that I have had >2 peers sync in my demo code, though I haven't checked it in a while.

On Wed, May 5, 2021 at 3:34 PM Speros Kokenes @.***> wrote:

@ept https://github.com/ept playing devil's advocate here, I agree async code can be confusing but it does help the community to see examples of how they would actually use the syncing protocol. The synchronous tests to date don't actually reflect how anyone would use the library and make it a little confusing to extrapolate from.

That being said, maybe that doesn't matter for a simple test where you are just trying to confirm that 3 peers can sync

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/automerge/automerge/pull/352#issuecomment-833093095, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAAWQALHXSLWBZCOAGTYKTTMHBYFANCNFSM44FIWS2A .

-- Peter van Hardenberg "Everything was beautiful, and nothing hurt."—Kurt Vonnegut

dan-weaver commented 3 years ago

Not 100% sure what the issue is but I moved some stuff around in https://github.com/automerge/automerge/pull/353 but kept the tests the same and they pass.. Sorry linter is failing. Feel free to take anything that’s useful and close it

I think it has to do with this bit here: https://github.com/automerge/automerge/pull/352/files#diff-7095874604758fffdc7bf502d9864c58e56a5a49730e50e6bcafa595409db0e7R182-R191

Following that code path it generates a new message instead of receiving it first?

ept commented 3 years ago

I had a go at an alternative implementation, which seems to work fine with three peers all connected to each other: https://gist.github.com/ept/fe4f772bcc9294c953728b2407f46c77

My version does not deliver messages immediately, but rather enqueues them (there is a separate queue for every pair of peers), and then delivers one message per time step. At the moment it just picks the first message in the first nonempty queue to deliver, but you could also imagine a different scheduler (e.g. round robin or randomised). I think this is a better simulation of how a real network behaves. In your implementation you get deep call stacks as A sends a message to B and then B sends a message back to A and then A sends another message to B... all synchronously. That makes it really hard to reason about what is happening when.

HerbCaudill commented 3 years ago

Thanks everyone. I've incorporated @dan-weaver 's structure from #353, which simplifies things a lot by having just one document that gets updated, rather than separately tracking one document per connection.

Martin @ept is also right that having things run synchronously doesn't match reality and is hard to reason about. I've added a 1-ms timeout before the channel emits the message, which yields so other things can happen before anyone reacts to the message. With that change everything works with as many as 5 peers.

Since this design doesn't give me a way to tell when all peers have converged, I put in a hack that just pauses for a sufficient (on my box!) delay before checking for convergence. I'll adapt Martin's queue design to do that correctly - just wanted to call off the hounds in case anyone was still looking into this (@pvh plz go back offline! 😀)

HerbCaudill commented 3 years ago

OK I've adapted @ept's Network and Peer helper classes and everything works now without any hacks. I've tested this with up to 7 peers connected in various configurations.

echarles commented 3 years ago

Does this PR complement https://github.com/automerge/automerge/pull/353? I am planning to use the integration tests to better understand what the sync protocol is really.

If this PR has other coverage than https://github.com/automerge/automerge/pull/353 it would be good to get it merged.

HerbCaudill commented 3 years ago

353 was just @dan-weaver showing me how to get my tests working — he closed his PR after I incorporated his changes into this one.

I wrote this just to understand the sync protocol myself, but I do think it would be useful to include something like this in the codebase, as it exercises the protocol at a higher level than the existing tests. I also think it's useful as documentation, as a working example of the sync process.