bitwalker / swarm

Easy clustering, registration, and distribution of worker processes for Erlang/Elixir
MIT License
1.2k stars 103 forks source link

Deadlock on simultaneous nodeup #91

Open kzemek opened 6 years ago

kzemek commented 6 years ago

I'm having an issue similar to #60, reproducible very often when I bring up containers with the app at roughly the same time. Looks like each node is waiting for another one, and they're perpetually stuck in :syncing state. Here are the :sys.get_status(Swarm.Tracer) results from my 5 nodes: https://pastebin.com/EYLg6YNE . No custom options set, all default; clustering with libcluster gossip strategy.

kzemek commented 6 years ago

Please see https://github.com/kzemek/swarm-deadlock-repro for reliable reproduction of the issue.

kzemek commented 6 years ago

These are the logs produced with debug: true: https://gist.github.com/kzemek/f5067111dcc5b1c40803b12966cd1a1f#file-gistfile1-txt There are no more debug logs after that point.

kzemek commented 6 years ago

I've also tried manipulating the choice of sync node in hopes that it would solve the lock: https://github.com/kzemek/swarm/commit/28516d93413fa41a54281ee0c3bb0f7a92a4058e

But instead, the states of the Swarm.Tracker processes got stranger: https://gist.github.com/kzemek/f5067111dcc5b1c40803b12966cd1a1f#file-nodes_sync_to_smallest-txt

All nodes tried to sync to repro_2 (the "smallest" node), except repro_2 itself which synced to repro_3. repro_3 synced successfully and was put into :tracking state, while at the same time repro_2 was put into :awaiting_sync_ack and sent cast {sync_recv,<16250.182.0>,{{0,1},0},[]} to repro_3. But sync_recv cast is not handled in :tracking state, so repro_2 got stuck, and so did other nodes that tried to sync to it.

kzemek commented 6 years ago

This particular issue is not there when reverting to commit c305633 (pre https://github.com/bitwalker/swarm/commit/412bad990c69748dac300bd69e6a26b988e71b0). The nodes all go into :tracking state almost instantly.

joxford531 commented 6 years ago

Seeing this issue as well. When I revert to version 3.1 I don't see any problems with deadlocking on startup.

malmovich commented 5 years ago

We've been having this issue as well, and I'm pretty sure we also had this in 3.3.1

In our case we observed the following scenario. Lets say we have node A,B and C and the following happens: A - :sync -> B B - :sync -> C C - :sync -> A

All nodes are now in syncing state waiting for a :sync_recv message.

So far we have resolved this with a state timeout in syncing, were stops the syncing and tries another node. It seems to work fine, however, this approach gave a few complications and made it a bit more complex. So a simpler approach could be to drop the pending_sync_request strategy and and just decline the sync request while syncing.