I have some questions about how handoff and conflicts are supposed to work in a netsplit scenario.
Example code
In my examples I use this code, just copy and paste it in iex, and connect the nodes via Node.connect/1
defmodule Example.ProcessSupervisor do
use Supervisor
require Logger
def start_link() do
Supervisor.start_link(__MODULE__, [], name: __MODULE__)
end
def init(_) do
children = [
worker(Example.Process, [], restart: :transient)
]
supervise(children, strategy: :simple_one_for_one)
end
@doc """
Registers a new worker, and creates the worker process
"""
def register() do
{:ok, _pid} = Supervisor.start_child(__MODULE__, [])
end
def add_items(name, items) do
GenServer.cast({:via, :swarm, name}, {:add_items, items})
end
def get_queue(name) do
GenServer.call({:via, :swarm, name}, :get_queue)
end
end
defmodule Example.Process do
require Logger
def start_link() do
GenServer.start_link(__MODULE__, [])
end
def init(_opts) do
{:ok, %{queue: []}}
end
def handle_call(:get_queue, _from, state) do
{:reply, {:ok, state.queue}, state}
end
def handle_call({:swarm, :begin_handoff}, _from, state) do
Logger.info("begin_handoff, state: #{inspect(state)}")
{:reply, {:resume, state}, state}
end
def handle_cast({:add_items, items}, state) do
{:noreply, %{state | queue: state.queue ++ items}}
end
def handle_cast({:swarm, :end_handoff, remote_state}, state) do
Logger.info("end_handoff, state: #{inspect(state)} remote_state: #{inspect(remote_state)}")
{:noreply, %{state | queue: state.queue ++ remote_state.queue}}
end
def handle_cast({:swarm, :resolve_conflict, remote_state}, state) do
Logger.info(
"resolve_conflict, state: #{inspect(state)}, remote_state: #{inspect(remote_state)}"
)
{:noreply, %{state | queue: state.queue ++ remote_state.queue}}
end
def handle_info({:swarm, :die}, state) do
Logger.info("[node #{inspect(node())}] swarm die")
{:stop, :normal, state}
end
end
{:ok, _pid} = Example.ProcessSupervisor.start_link()
To register a process, add items to it, and get the queue:
Cluster: 3 Nodes connected (:a@localhost, :b@localhost, :c@localhost)
Register a process and alter the state on both sides during netsplit
# Create netsplit
# On node a, run:
Node.disconnect(:b@localhost)
Node.disconnect(:c@localhost)
# Register on both sides
# On node a, run:
Swarm.register_name(:p1, Example.ProcessSupervisor, :register, [])
# On node b, run:
Swarm.register_name(:p1, Example.ProcessSupervisor, :register, [])
# Update states on both sides
# On node a, run:
Example.ProcessSupervisor.add_items(:p1, [:a, :b, :c])
# On node b, run:
Example.ProcessSupervisor.add_items(:p1, [:d, :e, :f])
# Reconnect
# On node a, run:
Node.connect(:b@localhost)
Node.connect(:c@localhost)
# begin_handoff and resolve_conflict are called
# On any node, run (and get back all items, :a,:b,:c,:d,:e,:f):
Example.ProcessSupervisor.get_queue(:p1)
Scenario 2
Cluster: 3 Nodes connected (:a@localhost, :b@localhost, :c@localhost)
Register a process before netsplit, and alter the state on both sides during netsplit
# Register process in the cluster
# On any node, run:
Swarm.register_name(:p1, Example.ProcessSupervisor, :register, [])
# Create netsplit
# On node a, run:
Node.disconnect(:b@localhost)
Node.disconnect(:c@localhost)
# Update states on both sides
# On node a, run:
Example.ProcessSupervisor.add_items(:p1, [:a, :b, :c])
# On node b, run:
Example.ProcessSupervisor.add_items(:p1, [:d, :e, :f])
# Reconnect
# On node a, run:
Node.connect(:b@localhost)
Node.connect(:c@localhost)
# begin_handoff and resolve_conflict are NOT called
# On any node, run (and you'll get back either [:a, :b, :c] or [:d, :e, :f], it will be the state of the pid that was first registered)
Example.ProcessSupervisor.get_queue(:p1)
Questions
The first scenario works just as I would expect, but the second scenario does not. Is this behavior by design or is it a bug?
If running a cluster of 2 nodes instead of 3, the behavior changes for the first scenario, then the begin_handoff and resolve_conflict are never called, and one of the states are discarded. Is there any limitation on minimum number of nodes or something, to get the handoff to work properly?
(I believe these questions might be related to issue #89 and/or issue #86 )
I have some questions about how handoff and conflicts are supposed to work in a netsplit scenario.
Example code
In my examples I use this code, just copy and paste it in iex, and connect the nodes via
Node.connect/1
To register a process, add items to it, and get the queue:
Scenario 1
Cluster: 3 Nodes connected (:a@localhost, :b@localhost, :c@localhost) Register a process and alter the state on both sides during netsplit
Scenario 2
Cluster: 3 Nodes connected (:a@localhost, :b@localhost, :c@localhost) Register a process before netsplit, and alter the state on both sides during netsplit
Questions
begin_handoff
andresolve_conflict
are never called, and one of the states are discarded. Is there any limitation on minimum number of nodes or something, to get the handoff to work properly?(I believe these questions might be related to issue #89 and/or issue #86 )