bitwalker / swarm

Easy clustering, registration, and distribution of worker processes for Erlang/Elixir
MIT License
1.19k stars 102 forks source link

Expected callbacks not received after network heals #89

Open markmeeus opened 6 years ago

markmeeus commented 6 years ago


Thanks for the great lib, awesome work!

We are building a system where we have to process a stream of messages from a remote system. To be more precise, the stream of data are update messages for a specific resources.

So we are using Swarm to manage worker processes for these resources. We create the process when the first message arrives, and let it die from a genserver timeout after a set interval.

We are now in the process of testing the app for a network split situation, and it looks like something is a bit off, or at least, we are a bit confused :-)

First of all, we are using the Distribution.Ring since we will be deploying on an OpenShift instance (Kubernetes) We are testing this by simply connecting 2 iex nodes running our app with Node.connect/1 and Node.disconnect/1

What happens is the following -> When the 2 nodes are connected, a new message for a new resource creates a process on one of the 2 nodes, perfect -> When the 2 nodes are disconnected, all processes that existed before the disconnect are started (if the were not already running) on both machines, still makes sense. -> When the 2 are connected again, the processes keep running on one of the nodes and the processes on the other node receive a {:swarm, :die}, strange, we were actually expecting them to be relocated, the Ring uses a consistent hash?

While the nodes were disconnected, both processes could have processed messages and they maybe in a conflicting state. We have way to actually resolve these conflicts and we were counting on the :resolve_conflict message to inform us about this situation.

However none of the :begin_handoff, :end_handoff or :resolve_conflict messages are received on either Node.

our begin_handoff returns a :resume tuple, but since it is not called ...

Could you help us out?

Here are some debug logs of the disconnect/connect:

Node A "attila" => disconnect

iex(attila@> Node.disconnect :"mark@"
*DBG* 'Elixir.Swarm.Tracker' receive info {'DOWN',#Ref<0.2529274800.2416967684.136847>,process,<22512.560.0>,
             noconnection} in state tracking
iex(attila@> *DBG* 'Elixir.Swarm.Tracker' consume info {'DOWN',#Ref<0.2529274800.2416967684.136847>,process,<22512.560.0>,
             noconnection} in state tracking
*DBG* 'Elixir.Swarm.Tracker' receive info {nodedown,'mark@',[{node_type,visible}]} in state tracking
*DBG* 'Elixir.Swarm.Tracker' consume info {nodedown,'mark@',[{node_type,visible}]} in state tracking
[debug] [swarm on attila@] [tracker:handle_monitor] lost connection to "RESOURCE401" (#PID<22512.560.0>) on mark@, node is down
[info] [swarm on attila@] [tracker:nodedown] nodedown mark@
[debug] [swarm on attila@] [tracker:handle_topology_change] topology change (nodedown for mark@
[debug] [swarm on attila@] [tracker:handle_topology_change] restarting "RESOURCE400" on attila@
[debug] [swarm on attila@] [tracker:do_track] starting "RESOURCE400" on attila@
[debug] [swarm on attila@] [tracker:do_track] started "RESOURCE400" on attila@
[debug] [swarm on attila@] [tracker:handle_topology_change] restarting "RESOURCE401" on attila@
[debug] [swarm on attila@] [tracker:do_track] starting "RESOURCE401" on attila@
[debug] [swarm on attila@] [tracker:do_track] started "RESOURCE401" on attila@
[info] [swarm on attila@] [tracker:handle_topology_change] topology change complete

Node B "mark" => received disconnect

*DBG* 'Elixir.Swarm.Tracker' receive info {'DOWN',#Ref<0.899263733.806354945.65248>,process,<24381.547.0>,
             noconnection} in state tracking
*DBG* 'Elixir.Swarm.Tracker' consume info {'DOWN',#Ref<0.899263733.806354945.65248>,process,<24381.547.0>,
             noconnection} in state tracking
*DBG* 'Elixir.Swarm.Tracker' receive info {nodedown,'attila@',[{node_type,visible}]} in state tracking
*DBG* 'Elixir.Swarm.Tracker' consume info {nodedown,'attila@',[{node_type,visible}]} in state tracking
[debug] [swarm on mark@] [tracker:handle_monitor] lost connection to "RESOURCE402" (#PID<24381.547.0>) on attila@, node is down
[info] [swarm on mark@] [tracker:nodedown] nodedown attila@
[debug] [swarm on mark@] [tracker:handle_topology_change] topology change (nodedown for attila@
[debug] [swarm on mark@] [tracker:handle_topology_change] restarting "RESOURCE402" on mark@
[debug] [swarm on mark@] [tracker:do_track] starting "RESOURCE402" on mark@
[debug] [swarm on mark@] [tracker:do_track] started "RESOURCE402" on mark@
[info] [swarm on mark@] [tracker:handle_topology_change] topology change complete

... Some processing going on while nodes are disconnected ...

Node A "attila" => connect with Node B

iex(attila@> Node.connect :"mark@"   
*DBG* 'Elixir.Swarm.Tracker' receive info {nodeup,'mark@',[{node_type,visible}]} in state tracking
iex(attila@> *DBG* 'Elixir.Swarm.Tracker' consume info {nodeup,'mark@',[{node_type,visible}]} in state tracking
*DBG* 'Elixir.Swarm.Tracker' receive cast {sync,<22512.358.0>,{1,0}} in state syncing
*DBG* 'Elixir.Swarm.Tracker' consume cast {sync,<22512.358.0>,{1,0}} in state syncing
[info] [swarm on attila@] [tracker:ensure_swarm_started_on_remote_node] nodeup mark@
*DBG* 'Elixir.Swarm.Tracker' receive cast {sync_recv,<22512.358.0>,
              #{all_workers => true,
                mfa =>
              #{all_workers => true,
                mfa =>
              #{all_workers => true,
                mfa =>
              {{0,{0,2,0}},{{1,0},{0,2,0}}}}]} in state syncing
"{:swarm, :die} RESOURCE402"
"{:swarm, :die} RESOURCE401"
"{:swarm, :die} RESOURCE400"
[info] [swarm on attila@] [tracker:cluster_wait] joining cluster..
[info] [swarm on attila@] [tracker:cluster_wait] found connected nodes: [:"mark@"]
[info] [swarm on attila@] [tracker:cluster_wait] selected sync node: mark@
[info] [swarm on attila@] [tracker:syncing] syncing from mark@ based on node precedence
*DBG* 'Elixir.Swarm.Tracker' consume cast {sync_recv,<22512.358.0>,
              #{all_workers => true,
                mfa =>
              #{all_workers => true,
                mfa =>
              #{all_workers => true,
                mfa =>
              {{0,{0,2,0}},{{1,0},{0,2,0}}}}]} in state syncing
[info] [swarm on attila@] [tracker:syncing] received registry from mark@, merging..
[info] [swarm on attila@] [tracker:syncing] local synchronization with mark@ complete!
[info] [swarm on attila@] [tracker:resolve_pending_sync_requests] pending sync requests cleared

Node B "mark" => receive 2nd connect

*DBG* 'Elixir.Swarm.Tracker' receive info {nodeup,'attila@',[{node_type,visible}]} in state tracking
*DBG* 'Elixir.Swarm.Tracker' consume info {nodeup,'attila@',[{node_type,visible}]} in state tracking
*DBG* 'Elixir.Swarm.Tracker' receive cast {sync,<24381.349.0>,{1,0}} in state syncing
*DBG* 'Elixir.Swarm.Tracker' consume cast {sync,<24381.349.0>,{1,0}} in state syncing
[info] [swarm on mark@] [tracker:ensure_swarm_started_on_remote_node] nodeup attila@
*DBG* 'Elixir.Swarm.Tracker' receive info {event,<24381.349.0>,{{0,1},{1,1}},{untrack,<24381.547.0>}} in state awaiting_sync_ack
*DBG* 'Elixir.Swarm.Tracker' postpone info {event,<24381.349.0>,{{0,1},{1,1}},{untrack,<24381.547.0>}} in state awaiting_sync_ack
*DBG* 'Elixir.Swarm.Tracker' receive info {event,<24381.349.0>,{{0,2},{1,2}},{untrack,<24381.550.0>}} in state awaiting_sync_ack
*DBG* 'Elixir.Swarm.Tracker' postpone info {event,<24381.349.0>,{{0,2},{1,2}},{untrack,<24381.550.0>}} in state awaiting_sync_ack
*DBG* 'Elixir.Swarm.Tracker' receive info {event,<24381.349.0>,{{0,3},{1,3}},{untrack,<24381.549.0>}} in state awaiting_sync_ack
*DBG* 'Elixir.Swarm.Tracker' postpone info {event,<24381.349.0>,{{0,3},{1,3}},{untrack,<24381.549.0>}} in state awaiting_sync_ack
*DBG* 'Elixir.Swarm.Tracker' receive cast {sync_ack,<24381.349.0>,
              #{all_workers => true,
                mfa =>
              #{all_workers => true,
                mfa =>
              #{all_workers => true,
                mfa =>
              {{0,{0,2,0}},{{1,0},{0,2,0}}}}]} in state awaiting_sync_ack
[info] [swarm on mark@] [tracker:cluster_wait] joining cluster..
[info] [swarm on mark@] [tracker:cluster_wait] found connected nodes: [:"attila@"]
[info] [swarm on mark@] [tracker:cluster_wait] selected sync node: attila@
[info] [swarm on mark@] [tracker:syncing] syncing to attila@ based on node precedence
*DBG* 'Elixir.Swarm.Tracker' consume cast {sync_ack,<24381.349.0>,
              #{all_workers => true,
                mfa =>
              #{all_workers => true,
                mfa =>
              #{all_workers => true,
                mfa =>
              {{0,{0,2,0}},{{1,0},{0,2,0}}}}]} in state awaiting_sync_ack
*DBG* 'Elixir.Swarm.Tracker' consume info {event,<24381.349.0>,{{0,1},{1,1}},{untrack,<24381.547.0>}} in state tracking
*DBG* 'Elixir.Swarm.Tracker' consume info {event,<24381.349.0>,{{0,2},{1,2}},{untrack,<24381.550.0>}} in state tracking
*DBG* 'Elixir.Swarm.Tracker' consume info {event,<24381.349.0>,{{0,3},{1,3}},{untrack,<24381.549.0>}} in state tracking
[info] [swarm on mark@] [tracker:awaiting_sync_ack] received sync acknowledgement from attila@, syncing with remote registry
[info] [swarm on mark@] [tracker:awaiting_sync_ack] local synchronization with attila@ complete!
[info] [swarm on mark@] [tracker:resolve_pending_sync_requests] pending sync requests cleared
[debug] [swarm on mark@] [tracker:handle_replica_event] replica event: untrack #PID<24381.547.0>
[debug] [swarm on mark@] [tracker:handle_replica_event] replica event: untrack #PID<24381.550.0>
[debug] [swarm on mark@] [tracker:handle_replica_event] replica event: untrack #PID<24381.549.0>
fredr commented 5 years ago

@markmeeus did you ever resolve this? and if so, what was the solution?

fredr commented 5 years ago

I've noticed that I have the same problem when running a cluster of 2 nodes, when I run 3 nodes the :begin_handoff and :resolve_conflict callbacks are called as expected.

x-ji commented 4 years ago

Same here. the processes are untracked but are never restarted on the other node for some reason. None of them events are called. Seems from some other issues that you may need to handle graceful shutdowns manually But they're only talking about "handoffs" even though I just want to do a restart... So it's still confusing.