derekkraan / horde

Horde is a distributed Supervisor and Registry backed by DeltaCrdt
MIT License
1.3k stars 102 forks source link

[BUG]: Horde Supervisor behaves one_for_all no matter what #269

Closed gmdorf closed 3 months ago

gmdorf commented 3 months ago

According to Supervisor docs, strategy: :one_for_one with automatic_shutdown: :never (the default, made explicit for the sake of illustration) should mean that if one child process stops, the rest of the children and the supervisor should continue running.

Assume that we have this minimal GenServer module defined:

defmodule BasicGenServer do
  use GenServer

  def start_link(arg) do
    GenServer.start_link(__MODULE__, arg)
  end

  @impl true
  def init(init_arg) do
    {:ok, init_arg}
  end

  @impl true
  def handle_info(:stop_normal, state) do
    {:stop, :normal, state}
  end
end

Running this basic test with DynamicSupervisor succeeds as expected:

{:ok, sup} = DynamicSupervisor.start_link([strategy: :one_for_one, name: TestSup, automatic_shutdown: :never])

{:ok, child1} = DynamicSupervisor.start_child(sup, %{id: :child1, start: {BasicGenServer, :start_link, [0]}})
{:ok, child2} = DynamicSupervisor.start_child(sup, %{id: :child2, start: {BasicGenServer, :start_link, [0]}})
{:ok, child3} = DynamicSupervisor.start_child(sup, %{id: :child3, start: {BasicGenServer, :start_link, [0]}})

:ok = Process.send(child3, :stop_normal, [])
IO.puts("Sent :stop_normal message to child of TestSup")
children = DynamicSupervisor.which_children(sup)

if Enum.count(children, fn {_id, _pid, _type, mod} -> [BasicGenServer] == mod end) != 3 do
  raise "Not 3 BasicGenServers for TestSup"
else
  IO.puts("3 children for DynamicSTestSupupervisor!")
end

But running the exact same test with DynamicSupervisor replaced with Horde.DynamicSupervisor crashes as soon as we send the :stop_normal message (i.e. "Sent :stop_normal message to child of TestHordeSup" is never printed):

{:ok, sup} = Horde.DynamicSupervisor.start_link([strategy: :one_for_one, name: TestHordeSup, automatic_shutdown: :never])

{:ok, child1} = Horde.DynamicSupervisor.start_child(sup, %{id: :child1, start: {BasicGenServer, :start_link, [0]}})
{:ok, child2} = Horde.DynamicSupervisor.start_child(sup, %{id: :child2, start: {BasicGenServer, :start_link, [0]}})
{:ok, child3} = Horde.DynamicSupervisor.start_child(sup, %{id: :child3, start: {BasicGenServer, :start_link, [0]}})

:ok = Process.send(child3, :stop_normal, [])
IO.puts("Sent :stop_normal message to child of TestHordeSup")
children = Horde.DynamicSupervisor.which_children(sup)

if Enum.count(children, fn {_id, _pid, _type, mod} -> [BasicGenServer] == mod end) != 3 do
  raise "Not 3 BasicGenServers for TestHordeSup"
else
  IO.puts("3 children for TestHordeSup!")
end
gmdorf commented 3 months ago

@derekkraan I am sure that I am doing something wrong here, but this issue/my mistake is my blocking my team in production.

gmdorf commented 3 months ago

This could be related maybe? https://github.com/derekkraan/horde/issues/219

derekkraan commented 3 months ago

Hi, can you put your failing test in a test in a branch and link it here?

gmdorf commented 3 months ago

I will put it in a branch within a second.

It seems to be an issue of names vs pids. If change the test to never refer to the pid for the DynamicSupervisor or Horde.DynamicSupervisor, but instead to the name, it passes.

defmodule BasicGenServer do
  use GenServer

  def start_link(arg) do
    GenServer.start_link(__MODULE__, arg)
  end

  @impl true
  def init(init_arg) do
    {:ok, init_arg}
  end

  @impl true
  def handle_info(:stop_normal, state) do
    {:stop, :normal, state}
  end
end

{:ok, _sup} = DynamicSupervisor.start_link([strategy: :one_for_one, name: TestSup, automatic_shutdown: :never])

{:ok, child1} = DynamicSupervisor.start_child(TestSup, %{id: :child1, start: {BasicGenServer, :start_link, [0]}})
{:ok, child2} = DynamicSupervisor.start_child(TestSup, %{id: :child2, start: {BasicGenServer, :start_link, [0]}})
{:ok, child3} = DynamicSupervisor.start_child(TestSup, %{id: :child3, start: {BasicGenServer, :start_link, [0]}})

Process.exit(child1, :kill)
IO.puts("Sent :stop_normal message to child of TestSup")
children = DynamicSupervisor.which_children(TestSup)

if Enum.count(children, fn {_id, _pid, _type, mod} -> [BasicGenServer] == mod end) != 3 do
  raise "Not 3 BasicGenServers for TestSup"
else
  IO.inspect(children)
  IO.puts("3 children for TestSup!")
end

{:ok, _sup} = Horde.DynamicSupervisor.start_link([strategy: :one_for_one, name: HordeTestSup, automatic_shutdown: :never])

{:ok, child1} = Horde.DynamicSupervisor.start_child(HordeTestSup, %{id: :child1, start: {BasicGenServer, :start_link, [0]}})
{:ok, child2} = Horde.DynamicSupervisor.start_child(HordeTestSup, %{id: :child2, start: {BasicGenServer, :start_link, [0]}})
{:ok, child3} = Horde.DynamicSupervisor.start_child(HordeTestSup, %{id: :child3, start: {BasicGenServer, :start_link, [0]}})

Process.exit(child1, :kill)
IO.puts("Sent :stop_normal message to child of TestHordeSup")
children = Horde.DynamicSupervisor.which_children(HordeTestSup)

if Enum.count(children, fn {_id, _pid, _type, mod} -> [BasicGenServer] == mod end) != 3 do
  raise "Not 3 BasicGenServers for TestHordeSup"
else
  IO.puts("3 children for TestHordeSup!")
end

Will add this to branch as well.

gmdorf commented 3 months ago

@derekkraan https://github.com/derekkraan/horde/pull/270

gmdorf commented 3 months ago

Could our discussion on our ElixirForum here be relevant? https://elixirforum.com/t/horde-dynamicsupervisor-terminate-child-is-not-working/61558/3?u=gavid

gmdorf commented 3 months ago

I have added the equivalent tests to the branch. Please let me know if you see anything you need improved or if certain tests are meant to fail e.g. per our discussion linked in previous message

derekkraan commented 3 months ago

Hi, I have looked at your tests and I understand now what the issue is. Horde.DynamicSupervisor only supports being called by name, not by the pid returned by Horde.DynamicSupervisor.start_link/3.

The reason for this is that Horde.DynamicSupervisor.start_link/3 starts a Supervisor with a number of children, one of them being the Horde.DynamicSupervisorImpl, which is where the name is registered to. The pid returned is really only meant to be used by the supervision tree. Usually you will start Horde.DynamicSupervisor in a supervision tree, and never see the pid.

So when you use the pid returned to start a new child, you are not running that child under Horde.DynamicSupervisor, but under a regular supervisor in the local supervision tree.

It's not possible to resolve this inconsistency without a major rewrite of the lib.

I hope this clarifies things. If you have any suggestions on how to improve the docs to make this more clear, I would consider a PR for that.