Open kzemek opened 6 years ago
Please see https://github.com/kzemek/swarm-deadlock-repro for reliable reproduction of the issue.
These are the logs produced with debug: true
: https://gist.github.com/kzemek/f5067111dcc5b1c40803b12966cd1a1f#file-gistfile1-txt
There are no more debug logs after that point.
I've also tried manipulating the choice of sync node in hopes that it would solve the lock: https://github.com/kzemek/swarm/commit/28516d93413fa41a54281ee0c3bb0f7a92a4058e
But instead, the states of the Swarm.Tracker
processes got stranger: https://gist.github.com/kzemek/f5067111dcc5b1c40803b12966cd1a1f#file-nodes_sync_to_smallest-txt
All nodes tried to sync to repro_2
(the "smallest" node), except repro_2
itself which synced to repro_3
. repro_3
synced successfully and was put into :tracking
state, while at the same time repro_2
was put into :awaiting_sync_ack
and sent cast {sync_recv,<16250.182.0>,{{0,1},0},[]}
to repro_3
. But sync_recv
cast is not handled in :tracking
state, so repro_2
got stuck, and so did other nodes that tried to sync to it.
This particular issue is not there when reverting to commit c305633 (pre https://github.com/bitwalker/swarm/commit/412bad990c69748dac300bd69e6a26b988e71b0). The nodes all go into :tracking
state almost instantly.
Seeing this issue as well. When I revert to version 3.1 I don't see any problems with deadlocking on startup.
We've been having this issue as well, and I'm pretty sure we also had this in 3.3.1
In our case we observed the following scenario. Lets say we have node A,B and C and the following happens: A - :sync -> B B - :sync -> C C - :sync -> A
All nodes are now in syncing
state waiting for a :sync_recv
message.
So far we have resolved this with a state timeout in syncing, were stops the syncing and tries another node. It seems to work fine, however, this approach gave a few complications and made it a bit more complex. So a simpler approach could be to drop the pending_sync_request strategy and and just decline the sync request while syncing.
I'm having an issue similar to #60, reproducible very often when I bring up containers with the app at roughly the same time. Looks like each node is waiting for another one, and they're perpetually stuck in
:syncing
state. Here are the:sys.get_status(Swarm.Tracer)
results from my 5 nodes: https://pastebin.com/EYLg6YNE . No custom options set, all default; clustering withlibcluster
gossip strategy.