bitwalker / swarm

Easy clustering, registration, and distribution of worker processes for Erlang/Elixir
MIT License
1.19k stars 102 forks source link

Ensure sync occurs upon initial cluster join #64

Closed bitwalker closed 6 years ago

bitwalker commented 6 years ago

When joining a cluster after initial startup, the tracker will be in tracking state, and needs an opportunity to sync with the cluster when it joins. Previously, this synchronization only happened during startup or during anti-entropy passes, but this commit ensures that if this initial join occurs during tracking, that it is caught and handled like the cluster_wait -> cluster_join transition, ensuring a sync right away.

See #62

@slashdotdash Can you review this and give me your thoughts? If you could do some testing, that would help, as I'm pretty swamped right now.

@pragdave If you can and want to, could you pull this branch (cluster_forms_while_tracking) and try to replicate #62? If this fixes the problem, I will merge it and push a new release.

pragdave commented 6 years ago

On it

pragdave commented 6 years ago

It works! But with a wrinkle. The node connection successfully syncs the previously registered name.

But I get an unrecognized cast:

iex(aa@plasma)1> Node.connect :bb@plasma
true
iex(aa@plasma)2>
18:25:29.449 [info]  [swarm on aa@plasma] [tracker:ensure_swarm_started_on_remote_node] nodeup bb@plasma

18:25:29.449 [info]  [swarm on aa@plasma] [tracker:cluster_wait] joining cluster..

18:25:29.449 [info]  [swarm on aa@plasma] [tracker:cluster_wait] found connected nodes: [:bb@plasma]

18:25:29.449 [info]  [swarm on aa@plasma] [tracker:cluster_wait] selected sync node: bb@plasma

18:25:29.470 [info]  [swarm on aa@plasma] [tracker:syncing] there is a tie between syncing nodes, breaking with die roll (13)..

18:25:29.470 [info]  [swarm on aa@plasma] [tracker:syncing] there is a tie between syncing nodes, breaking with die roll (9)..

18:25:29.470 [info]  [swarm on aa@plasma] [tracker:syncing] we won the die roll (9 vs 2), sending registry..

18:25:29.471 [info]  [swarm on aa@plasma] [tracker:awaiting_sync_ack] received sync acknowledgement from bb@plasma

18:25:29.471 [info]  [swarm on aa@plasma] [tracker:resolve_pending_sync_requests] pending sync requests cleared

18:25:29.477 [warn]  [swarm on aa@plasma] [tracker:handle_cast] unrecognized cast: {:sync_end_tiebreaker, #PID<18024.226.0>, 13, 7}
bitwalker commented 6 years ago

Definitely an ignorable message for the moment, since we work with the first die roll which breaks a tie, and the transition out of the syncing state is why that cast is unhandled. However, we probably should be choosing one node or the other deterministically during the tiebreaking process so we only roll once, avoiding that second roll.

bitwalker commented 6 years ago

I'll merge this for now, and make a note to address that in a follow on PR

bitwalker commented 6 years ago

@pragdave When you get a chance, could you send me the log from bb@plasma from the above test you ran? I'd like to trace back where that extra die roll is being triggered.

pragdave commented 6 years ago

[image: Inline image 1]

On Wed, Jan 31, 2018 at 6:50 PM, Paul Schoenfelder <notifications@github.com

wrote:

@pragdave https://github.com/pragdave When you get a chance, could you send me the log from bb@plasma from the above test you ran? I'd like to trace back where that extra die roll is being triggered.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bitwalker/swarm/pull/64#issuecomment-362121431, or mute the thread https://github.com/notifications/unsubscribe-auth/AAApmHQgOeFlqGnNezIXbhZPp9gMF7zFks5tQQpxgaJpZM4R03Pf .

bitwalker commented 6 years ago

@pragdave For some reason the image isn't showing for me :(