bitwalker / swarm

Easy clustering, registration, and distribution of worker processes for Erlang/Elixir
MIT License
1.2k stars 103 forks source link

Netsplit conflict behavior #124

Closed fredr closed 3 years ago

fredr commented 5 years ago

I have some questions about how handoff and conflicts are supposed to work in a netsplit scenario.

Example code

In my examples I use this code, just copy and paste it in iex, and connect the nodes via Node.connect/1

defmodule Example.ProcessSupervisor do
  use Supervisor
  require Logger

  def start_link() do
    Supervisor.start_link(__MODULE__, [], name: __MODULE__)
  end

  def init(_) do
    children = [
      worker(Example.Process, [], restart: :transient)
    ]

    supervise(children, strategy: :simple_one_for_one)
  end

  @doc """
  Registers a new worker, and creates the worker process
  """
  def register() do
    {:ok, _pid} = Supervisor.start_child(__MODULE__, [])
  end

  def add_items(name, items) do
    GenServer.cast({:via, :swarm, name}, {:add_items, items})
  end

  def get_queue(name) do
    GenServer.call({:via, :swarm, name}, :get_queue)
  end
end

defmodule Example.Process do
  require Logger

  def start_link() do
    GenServer.start_link(__MODULE__, [])
  end

  def init(_opts) do
    {:ok, %{queue: []}}
  end

  def handle_call(:get_queue, _from, state) do
    {:reply, {:ok, state.queue}, state}
  end

  def handle_call({:swarm, :begin_handoff}, _from, state) do
    Logger.info("begin_handoff, state: #{inspect(state)}")
    {:reply, {:resume, state}, state}
  end

  def handle_cast({:add_items, items}, state) do
    {:noreply, %{state | queue: state.queue ++ items}}
  end

  def handle_cast({:swarm, :end_handoff, remote_state}, state) do
    Logger.info("end_handoff, state: #{inspect(state)} remote_state: #{inspect(remote_state)}")

    {:noreply, %{state | queue: state.queue ++ remote_state.queue}}
  end

  def handle_cast({:swarm, :resolve_conflict, remote_state}, state) do
    Logger.info(
      "resolve_conflict, state: #{inspect(state)}, remote_state: #{inspect(remote_state)}"
    )

    {:noreply, %{state | queue: state.queue ++ remote_state.queue}}
  end

  def handle_info({:swarm, :die}, state) do
    Logger.info("[node #{inspect(node())}] swarm die")
    {:stop, :normal, state}
  end
end

{:ok, _pid} = Example.ProcessSupervisor.start_link()

To register a process, add items to it, and get the queue:

Swarm.register_name(:p1, Example.ProcessSupervisor, :register, [])
Example.ProcessSupervisor.add_items(:p1, [:a, :b, :c])
Example.ProcessSupervisor.get_queue(:p1)

Scenario 1

Cluster: 3 Nodes connected (:a@localhost, :b@localhost, :c@localhost) Register a process and alter the state on both sides during netsplit

# Create netsplit
# On node a, run:
Node.disconnect(:b@localhost)
Node.disconnect(:c@localhost)

# Register on both sides
# On node a, run:
Swarm.register_name(:p1, Example.ProcessSupervisor, :register, [])
# On node b, run:
Swarm.register_name(:p1, Example.ProcessSupervisor, :register, [])

# Update states on both sides
# On node a, run:
Example.ProcessSupervisor.add_items(:p1, [:a, :b, :c])
# On node b, run:
Example.ProcessSupervisor.add_items(:p1, [:d, :e, :f])

# Reconnect
# On node a, run:
Node.connect(:b@localhost)
Node.connect(:c@localhost)

# begin_handoff and resolve_conflict are called
# On any node, run (and get back all items, :a,:b,:c,:d,:e,:f):
Example.ProcessSupervisor.get_queue(:p1)

Scenario 2

Cluster: 3 Nodes connected (:a@localhost, :b@localhost, :c@localhost) Register a process before netsplit, and alter the state on both sides during netsplit

# Register process in the cluster
# On any node, run:
Swarm.register_name(:p1, Example.ProcessSupervisor, :register, [])

# Create netsplit
# On node a, run:
Node.disconnect(:b@localhost)
Node.disconnect(:c@localhost)

# Update states on both sides
# On node a, run:
Example.ProcessSupervisor.add_items(:p1, [:a, :b, :c])
# On node b, run:
Example.ProcessSupervisor.add_items(:p1, [:d, :e, :f])

# Reconnect
# On node a, run:
Node.connect(:b@localhost)
Node.connect(:c@localhost)

# begin_handoff and resolve_conflict are NOT called
# On any node, run (and you'll get back either [:a, :b, :c] or [:d, :e, :f], it will be the state of the pid that was first registered)
Example.ProcessSupervisor.get_queue(:p1)

Questions

  1. The first scenario works just as I would expect, but the second scenario does not. Is this behavior by design or is it a bug?
  2. If running a cluster of 2 nodes instead of 3, the behavior changes for the first scenario, then the begin_handoff and resolve_conflict are never called, and one of the states are discarded. Is there any limitation on minimum number of nodes or something, to get the handoff to work properly?

(I believe these questions might be related to issue #89 and/or issue #86 )

fredr commented 3 years ago

Closing this stale issue