derekkraan / horde

Horde is a distributed Supervisor and Registry backed by DeltaCrdt
MIT License
1.32k stars 106 forks source link

Clustered app with Horde.Registry not syncing registered process #225

Closed brainlid closed 3 years ago

brainlid commented 3 years ago

This is not a timing issue. It doesn't sync ever. At least not how I'd expect.

Version: 0.8.3 Horde Experience Level: Trying out horde for the first time Expectation: Creating a new GenServer through the supervisor gets registered in the Horde Registry and is accessible on multiple nodes in the cluster. Behavior Observed: The GenServer gets created (maybe even on the other other node from the execution), but the Registry on the non-owning node never gets the entry registered.

I have a very simple app. Here's the application.ex config.

  def start(_type, _args) do
    env = Application.get_env(:tictac, :env)

    children = [
      # Start the Telemetry supervisor
      TictacWeb.Telemetry,
      # Start the PubSub system
      {Phoenix.PubSub, name: Tictac.PubSub},
      # setup for clustering
      {Cluster.Supervisor, [libcluster(env), [name: Tictac.ClusterSupervisor]]},
      # Start the registry for tracking running games
      {Horde.Registry, [name: Tictac.GameRegistry, keys: :unique]},
      {Horde.DynamicSupervisor, [name: Tictac.DistributedSupervisor, strategy: :one_for_one, members: :auto]},
      TictacWeb.Endpoint
    ]

    # See https://hexdocs.pm/elixir/Supervisor.html
    # for other strategies and supported options
    opts = [strategy: :one_for_one, name: Tictac.Supervisor]
    Supervisor.start_link(children, opts)
  end

  # ...

  defp libcluster(other) do
    Logger.info("Using libcluster(_) mode with #{inspect(other)}. Empd strategy")

    [
      topologies: [
        strategy: Cluster.Strategy.Epmd,
        config: [hosts: [:"a@127.0.0.1", :"b@127.0.0.1"]]
      ]
    ]
  end

NOTE: members: :auto is used.

It is using libcluster. When locally creating two nodes that are clustered together. I can verify they are clustered.

Starting the nodes like this:

iex --name a@127.0.0.1 --cookie asdf -S mix
iex --name b@127.0.0.1 --cookie asdf -S mix

Shows they are clustered together.

Node.list
[:"a@127.0.0.1"]

I have a simple GenServer that follows the example given here.

When I create a new GenServer, it correctly creates the instance and registers it on one of the nodes.

Horde.DynamicSupervisor.start_child(Tictac.DistributedSupervisor, {Tictac.GameServer, [name: "ABCD"]})                         
{:ok, #PID<19367.431.0>}

# ... wait 60+ minutes

Horde.Registry.lookup(Tictac.GameRegistry, "ABCD")                             
[]

Running the same command from the remote node (where the process was created).

Horde.Registry.lookup(Tictac.GameRegistry, "ABCD")  
[{#PID<0.431.0>, nil}]

Notice that the first returned PID is on the remote node. When I do the Registry lookup, it works when it is on the local node. Even after waiting many minutes, the node it was not created on will not find the entry. This means I'm unable to find or send messages to the GenServer.

Here's the GenServer module.

defmodule Tictac.GameServer do
  @moduledoc """
  A GenServer that manages and models the state for a specific game instance.
  """
  use GenServer
  require Logger

  alias __MODULE__

  # Client

  def child_spec(opts) do
    IO.inspect opts, label: "OPTS"
    name = Keyword.get(opts, :name, GameServer)

    %{
      id: "#{GameServer}_#{name}",
      start: {GameServer, :start_link, [name]},
      shutdown: 10_000,
      restart: :transient
    }
  end

  @doc """
  Start a GameServer with the specified game_code as the name.
  """
  def start_link(name) do
    case GenServer.start_link(GameServer, [], name: via_tuple(name)) do
      {:ok, pid} ->
        {:ok, pid}

      {:error, {:already_started, pid}} ->
        Logger.info(
          "Already started GameServer #{inspect(name)} at #{inspect(pid)}, returning :ignore"
        )

        :ignore
    end
  end

  def init(_args) do
    {:ok, nil}
  end

  @doc """
  Return the `:via` tuple for referencing and interacting with a specific
  GameServer.
  """
  def via_tuple(game_code), do: {:via, Horde.Registry, {Tictac.GameRegistry, game_code}}
end

I would expect the created GenServer to be returned on all nodes when using Horde.Registry.lookup(Tictac.GameRegistry, "ABCD"). But even after many minutes, it never syncs to the other node. It will return on the hosting/local node, but not the other/remote one.

I did note that when I kill the node hosting the server, it is moved to the remaining server and then the lookup works. This shows that the clustering is all working, but for me the registry lookup is not working.

brainlid commented 3 years ago

I found my problem. The application setup wasn't correct.

      {Horde.Registry, [name: Tictac.GameRegistry, keys: :unique, members: :auto]},
      {Horde.DynamicSupervisor, [name: Tictac.DistributedSupervisor, strategy: :one_for_one, members: :auto]},

I was missing the members: :auto on Horde.Registry. It was unclear from the documentation I was reading that this was needed. As I dug further into the registry docs, I realized that it also had this flag option.

So it's working! Yay!

Thank you for the library!