Expected callbacks not received after network heals

Hi,

Thanks for the great lib, awesome work!

We are building a system where we have to process a stream of messages from a remote system. To be more precise, the stream of data are update messages for a specific resources.

So we are using Swarm to manage worker processes for these resources. We create the process when the first message arrives, and let it die from a genserver timeout after a set interval.

We are now in the process of testing the app for a network split situation, and it looks like something is a bit off, or at least, we are a bit confused :-)

First of all, we are using the Distribution.Ring since we will be deploying on an OpenShift instance (Kubernetes) We are testing this by simply connecting 2 iex nodes running our app with Node.connect/1 and Node.disconnect/1

What happens is the following -> When the 2 nodes are connected, a new message for a new resource creates a process on one of the 2 nodes, perfect -> When the 2 nodes are disconnected, all processes that existed before the disconnect are started (if the were not already running) on both machines, still makes sense. -> When the 2 are connected again, the processes keep running on one of the nodes and the processes on the other node receive a {:swarm, :die}, strange, we were actually expecting them to be relocated, the Ring uses a consistent hash?

While the nodes were disconnected, both processes could have processed messages and they maybe in a conflicting state. We have way to actually resolve these conflicts and we were counting on the :resolve_conflict message to inform us about this situation.

However none of the :begin_handoff, :end_handoff or :resolve_conflict messages are received on either Node.

our begin_handoff returns a :resume tuple, but since it is not called ...

Could you help us out?

Here are some debug logs of the disconnect/connect:

Node A "attila" => disconnect

iex(attila@10.68.38.221)2> Node.disconnect :"mark@10.68.38.221"
*DBG* 'Elixir.Swarm.Tracker' receive info {'DOWN',#Ref<0.2529274800.2416967684.136847>,process,<22512.560.0>,
             noconnection} in state tracking
true
iex(attila@10.68.38.221)3> *DBG* 'Elixir.Swarm.Tracker' consume info {'DOWN',#Ref<0.2529274800.2416967684.136847>,process,<22512.560.0>,
             noconnection} in state tracking
*DBG* 'Elixir.Swarm.Tracker' receive info {nodedown,'mark@10.68.38.221',[{node_type,visible}]} in state tracking
*DBG* 'Elixir.Swarm.Tracker' consume info {nodedown,'mark@10.68.38.221',[{node_type,visible}]} in state tracking
[debug] [swarm on attila@10.68.38.221] [tracker:handle_monitor] lost connection to "RESOURCE401" (#PID<22512.560.0>) on mark@10.68.38.221, node is down
[info] [swarm on attila@10.68.38.221] [tracker:nodedown] nodedown mark@10.68.38.221
[debug] [swarm on attila@10.68.38.221] [tracker:handle_topology_change] topology change (nodedown for mark@10.68.38.221)
[debug] [swarm on attila@10.68.38.221] [tracker:handle_topology_change] restarting "RESOURCE400" on attila@10.68.38.221
[debug] [swarm on attila@10.68.38.221] [tracker:do_track] starting "RESOURCE400" on attila@10.68.38.221
[debug] [swarm on attila@10.68.38.221] [tracker:do_track] started "RESOURCE400" on attila@10.68.38.221
[debug] [swarm on attila@10.68.38.221] [tracker:handle_topology_change] restarting "RESOURCE401" on attila@10.68.38.221
[debug] [swarm on attila@10.68.38.221] [tracker:do_track] starting "RESOURCE401" on attila@10.68.38.221
[debug] [swarm on attila@10.68.38.221] [tracker:do_track] started "RESOURCE401" on attila@10.68.38.221
[info] [swarm on attila@10.68.38.221] [tracker:handle_topology_change] topology change complete

Node B "mark" => received disconnect

*DBG* 'Elixir.Swarm.Tracker' receive info {'DOWN',#Ref<0.899263733.806354945.65248>,process,<24381.547.0>,
             noconnection} in state tracking
*DBG* 'Elixir.Swarm.Tracker' consume info {'DOWN',#Ref<0.899263733.806354945.65248>,process,<24381.547.0>,
             noconnection} in state tracking
*DBG* 'Elixir.Swarm.Tracker' receive info {nodedown,'attila@10.68.38.221',[{node_type,visible}]} in state tracking
*DBG* 'Elixir.Swarm.Tracker' consume info {nodedown,'attila@10.68.38.221',[{node_type,visible}]} in state tracking
[debug] [swarm on mark@10.68.38.221] [tracker:handle_monitor] lost connection to "RESOURCE402" (#PID<24381.547.0>) on attila@10.68.38.221, node is down
[info] [swarm on mark@10.68.38.221] [tracker:nodedown] nodedown attila@10.68.38.221
[debug] [swarm on mark@10.68.38.221] [tracker:handle_topology_change] topology change (nodedown for attila@10.68.38.221)
[debug] [swarm on mark@10.68.38.221] [tracker:handle_topology_change] restarting "RESOURCE402" on mark@10.68.38.221
[debug] [swarm on mark@10.68.38.221] [tracker:do_track] starting "RESOURCE402" on mark@10.68.38.221
[debug] [swarm on mark@10.68.38.221] [tracker:do_track] started "RESOURCE402" on mark@10.68.38.221
[info] [swarm on mark@10.68.38.221] [tracker:handle_topology_change] topology change complete

... Some processing going on while nodes are disconnected ...

Node A "attila" => connect with Node B

iex(attila@10.68.38.221)3> Node.connect :"mark@10.68.38.221"   
*DBG* 'Elixir.Swarm.Tracker' receive info {nodeup,'mark@10.68.38.221',[{node_type,visible}]} in state tracking
true
iex(attila@10.68.38.221)4> *DBG* 'Elixir.Swarm.Tracker' consume info {nodeup,'mark@10.68.38.221',[{node_type,visible}]} in state tracking
*DBG* 'Elixir.Swarm.Tracker' receive cast {sync,<22512.358.0>,{1,0}} in state syncing
*DBG* 'Elixir.Swarm.Tracker' consume cast {sync,<22512.358.0>,{1,0}} in state syncing
[info] [swarm on attila@10.68.38.221] [tracker:ensure_swarm_started_on_remote_node] nodeup mark@10.68.38.221
*DBG* 'Elixir.Swarm.Tracker' receive cast {sync_recv,<22512.358.0>,
         {{0,1},0},
         [{entry,<<"RESOURCE402">>,<22512.568.0>,
              #Ref<22512.899263733.806354946.64013>,
              #{all_workers => true,
                mfa =>
                    {'Elixir.Cadex.Swarm.Supervisor',start_unit_processor,
                        [<<"RESOURCE402">>]}},
              {{0,{0,10,0}},{{1,0},{0,10,0}}}},
          {entry,<<"RESOURCE401">>,<22512.560.0>,
              #Ref<22512.899263733.806354949.65462>,
              #{all_workers => true,
                mfa =>
                    {'Elixir.Cadex.Swarm.Supervisor',start_unit_processor,
                        [<<"RESOURCE401">>]}},
              {{0,{0,4,0}},{{1,0},{0,4,0}}}},
          {entry,<<"RESOURCE400">>,<22512.557.0>,
              #Ref<22512.899263733.806354949.64803>,
              #{all_workers => true,
                mfa =>
                    {'Elixir.Cadex.Swarm.Supervisor',start_unit_processor,
                        [<<"RESOURCE400">>]}},
              {{0,{0,2,0}},{{1,0},{0,2,0}}}}]} in state syncing
"{:swarm, :die} RESOURCE402"
"{:swarm, :die} RESOURCE401"
"{:swarm, :die} RESOURCE400"
[info] [swarm on attila@10.68.38.221] [tracker:cluster_wait] joining cluster..
[info] [swarm on attila@10.68.38.221] [tracker:cluster_wait] found connected nodes: [:"mark@10.68.38.221"]
[info] [swarm on attila@10.68.38.221] [tracker:cluster_wait] selected sync node: mark@10.68.38.221
[info] [swarm on attila@10.68.38.221] [tracker:syncing] syncing from mark@10.68.38.221 based on node precedence
*DBG* 'Elixir.Swarm.Tracker' consume cast {sync_recv,<22512.358.0>,
         {{0,1},0},
         [{entry,<<"RESOURCE402">>,<22512.568.0>,
              #Ref<22512.899263733.806354946.64013>,
              #{all_workers => true,
                mfa =>
                    {'Elixir.Cadex.Swarm.Supervisor',start_unit_processor,
                        [<<"RESOURCE402">>]}},
              {{0,{0,10,0}},{{1,0},{0,10,0}}}},
          {entry,<<"RESOURCE401">>,<22512.560.0>,
              #Ref<22512.899263733.806354949.65462>,
              #{all_workers => true,
                mfa =>
                    {'Elixir.Cadex.Swarm.Supervisor',start_unit_processor,
                        [<<"RESOURCE401">>]}},
              {{0,{0,4,0}},{{1,0},{0,4,0}}}},
          {entry,<<"RESOURCE400">>,<22512.557.0>,
              #Ref<22512.899263733.806354949.64803>,
              #{all_workers => true,
                mfa =>
                    {'Elixir.Cadex.Swarm.Supervisor',start_unit_processor,
                        [<<"RESOURCE400">>]}},
              {{0,{0,2,0}},{{1,0},{0,2,0}}}}]} in state syncing
[info] [swarm on attila@10.68.38.221] [tracker:syncing] received registry from mark@10.68.38.221, merging..
[info] [swarm on attila@10.68.38.221] [tracker:syncing] local synchronization with mark@10.68.38.221 complete!
[info] [swarm on attila@10.68.38.221] [tracker:resolve_pending_sync_requests] pending sync requests cleared

Node B "mark" => receive 2nd connect

*DBG* 'Elixir.Swarm.Tracker' receive info {nodeup,'attila@10.68.38.221',[{node_type,visible}]} in state tracking
*DBG* 'Elixir.Swarm.Tracker' consume info {nodeup,'attila@10.68.38.221',[{node_type,visible}]} in state tracking
*DBG* 'Elixir.Swarm.Tracker' receive cast {sync,<24381.349.0>,{1,0}} in state syncing
*DBG* 'Elixir.Swarm.Tracker' consume cast {sync,<24381.349.0>,{1,0}} in state syncing
[info] [swarm on mark@10.68.38.221] [tracker:ensure_swarm_started_on_remote_node] nodeup attila@10.68.38.221
*DBG* 'Elixir.Swarm.Tracker' receive info {event,<24381.349.0>,{{0,1},{1,1}},{untrack,<24381.547.0>}} in state awaiting_sync_ack
*DBG* 'Elixir.Swarm.Tracker' postpone info {event,<24381.349.0>,{{0,1},{1,1}},{untrack,<24381.547.0>}} in state awaiting_sync_ack
*DBG* 'Elixir.Swarm.Tracker' receive info {event,<24381.349.0>,{{0,2},{1,2}},{untrack,<24381.550.0>}} in state awaiting_sync_ack
*DBG* 'Elixir.Swarm.Tracker' postpone info {event,<24381.349.0>,{{0,2},{1,2}},{untrack,<24381.550.0>}} in state awaiting_sync_ack
*DBG* 'Elixir.Swarm.Tracker' receive info {event,<24381.349.0>,{{0,3},{1,3}},{untrack,<24381.549.0>}} in state awaiting_sync_ack
*DBG* 'Elixir.Swarm.Tracker' postpone info {event,<24381.349.0>,{{0,3},{1,3}},{untrack,<24381.549.0>}} in state awaiting_sync_ack
*DBG* 'Elixir.Swarm.Tracker' receive cast {sync_ack,<24381.349.0>,
         {1,3},
         [{entry,<<"RESOURCE402">>,<0.568.0>,
              #Ref<24381.2529274800.2416967683.128726>,
              #{all_workers => true,
                mfa =>
                    {'Elixir.Cadex.Swarm.Supervisor',start_unit_processor,
                        [<<"RESOURCE402">>]}},
              {{0,{0,10,0}},{{1,0},{0,10,0}}}},
          {entry,<<"RESOURCE401">>,<0.560.0>,
              #Ref<24381.2529274800.2416967683.128729>,
              #{all_workers => true,
                mfa =>
                    {'Elixir.Cadex.Swarm.Supervisor',start_unit_processor,
                        [<<"RESOURCE401">>]}},
              {{0,{0,4,0}},{{1,0},{0,4,0}}}},
          {entry,<<"RESOURCE400">>,<0.557.0>,
              #Ref<24381.2529274800.2416967683.128731>,
              #{all_workers => true,
                mfa =>
                    {'Elixir.Cadex.Swarm.Supervisor',start_unit_processor,
                        [<<"RESOURCE400">>]}},
              {{0,{0,2,0}},{{1,0},{0,2,0}}}}]} in state awaiting_sync_ack
[info] [swarm on mark@10.68.38.221] [tracker:cluster_wait] joining cluster..
[info] [swarm on mark@10.68.38.221] [tracker:cluster_wait] found connected nodes: [:"attila@10.68.38.221"]
[info] [swarm on mark@10.68.38.221] [tracker:cluster_wait] selected sync node: attila@10.68.38.221
[info] [swarm on mark@10.68.38.221] [tracker:syncing] syncing to attila@10.68.38.221 based on node precedence
*DBG* 'Elixir.Swarm.Tracker' consume cast {sync_ack,<24381.349.0>,
         {1,3},
         [{entry,<<"RESOURCE402">>,<0.568.0>,
              #Ref<24381.2529274800.2416967683.128726>,
              #{all_workers => true,
                mfa =>
                    {'Elixir.Cadex.Swarm.Supervisor',start_unit_processor,
                        [<<"RESOURCE402">>]}},
              {{0,{0,10,0}},{{1,0},{0,10,0}}}},
          {entry,<<"RESOURCE401">>,<0.560.0>,
              #Ref<24381.2529274800.2416967683.128729>,
              #{all_workers => true,
                mfa =>
                    {'Elixir.Cadex.Swarm.Supervisor',start_unit_processor,
                        [<<"RESOURCE401">>]}},
              {{0,{0,4,0}},{{1,0},{0,4,0}}}},
          {entry,<<"RESOURCE400">>,<0.557.0>,
              #Ref<24381.2529274800.2416967683.128731>,
              #{all_workers => true,
                mfa =>
                    {'Elixir.Cadex.Swarm.Supervisor',start_unit_processor,
                        [<<"RESOURCE400">>]}},
              {{0,{0,2,0}},{{1,0},{0,2,0}}}}]} in state awaiting_sync_ack
*DBG* 'Elixir.Swarm.Tracker' consume info {event,<24381.349.0>,{{0,1},{1,1}},{untrack,<24381.547.0>}} in state tracking
*DBG* 'Elixir.Swarm.Tracker' consume info {event,<24381.349.0>,{{0,2},{1,2}},{untrack,<24381.550.0>}} in state tracking
*DBG* 'Elixir.Swarm.Tracker' consume info {event,<24381.349.0>,{{0,3},{1,3}},{untrack,<24381.549.0>}} in state tracking
[info] [swarm on mark@10.68.38.221] [tracker:awaiting_sync_ack] received sync acknowledgement from attila@10.68.38.221, syncing with remote registry
[info] [swarm on mark@10.68.38.221] [tracker:awaiting_sync_ack] local synchronization with attila@10.68.38.221 complete!
[info] [swarm on mark@10.68.38.221] [tracker:resolve_pending_sync_requests] pending sync requests cleared
[debug] [swarm on mark@10.68.38.221] [tracker:handle_replica_event] replica event: untrack #PID<24381.547.0>
[debug] [swarm on mark@10.68.38.221] [tracker:handle_replica_event] replica event: untrack #PID<24381.550.0>
[debug] [swarm on mark@10.68.38.221] [tracker:handle_replica_event] replica event: untrack #PID<24381.549.0>

bitwalker / swarm