derekkraan / horde

Horde is a distributed Supervisor and Registry backed by DeltaCrdt
MIT License
1.3k stars 102 forks source link

Horde.DynamicSupervisor in active process redistribution clashes with Horde.Registry name conflicts #213

Open balena opened 3 years ago

balena commented 3 years ago

I've been using Horde.DynamicSupervisor along with Horde.Registry, whereas the child id passed to the dynamic supervisor is unique and it is also registered under another unique name, using a :via tuple.

This cluster changes nodes and membership from time to time, and I started to observe what I think is a race condition between {:shutdown, :process_redistribution} and :name_conflict messages. The sequence is like this:

17:37:39.768 [info] Starting Horde.RegistryImpl with name UsageTracker.Cluster.Registry
17:37:39.769 [info] Starting Horde.DynamicSupervisorImpl with name UsageTracker.Cluster.Supervisor
...
17:37:49.492 [info] [fs_prober<0.3764.0>] name conflict, shutdown
17:37:49.492 [info] [service_discovery<0.3761.0>] name conflict, shutdown
17:37:49.493 [info] [fs_prober<0.3770.0>] name conflict, shutdown
17:37:49.499 [error] GenServer UsageTracker.Cluster.Supervisor terminating
** (stop) exited in: GenServer.call(UsageTracker.Cluster.Supervisor.ProcessesSupervisor, {:send_exit_signal, #PID<0.3770.0>, {:shutdown, :process_redistribution}}, 5000)
    ** (EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started
    (elixir 1.10.4) lib/gen_server.ex:1013: GenServer.call/3
    (horde 0.8.2) lib/horde/dynamic_supervisor_impl.ex:687: Horde.DynamicSupervisorImpl.handoff_child/2
    (elixir 1.10.4) lib/enum.ex:2111: Enum."-reduce/3-lists^foldl/2-0-"/3
    (horde 0.8.2) lib/horde/dynamic_supervisor_impl.ex:376: Horde.DynamicSupervisorImpl.handle_info/2
    (stdlib 3.13) gen_server.erl:680: :gen_server.try_dispatch/4
    (stdlib 3.13) gen_server.erl:756: :gen_server.handle_msg/6
    (stdlib 3.13) proc_lib.erl:226: :proc_lib.init_p_do_apply/3
Last message: {:crdt_update, [{:add, {:process, 106133835503947574045779105170090656072}, {{UsageTracker.Cluster.Supervisor, :"usage_tracker@10.15.170.33"}, %{id: 106133835503947574045779105170090656072, restart: :transient, start: {UsageTracker.Prober, :start_link, [[event_listeners: [{"CHANNEL_HANGUP_COMPLETE", UsageTracker.Aggregators.MOS}], pubsub: :pubsub, group: :global, target: "10.15.169.4", node: "tel-sy1-ibm-prod-064", datacenter: "sy1-prod", name: {:via, UsageTracker.Cluster, {:fs_prober, "tel-sy1-ibm-prod-064", "sy1-prod", "10.15.169.4"}}]]}}, #PID<38772.5932.0>}}, {:add, {:member_node_info, {UsageTracker.Cluster.Supervisor, :"usage_tracker@10.15.170.33"}}, %Horde.DynamicSupervisor.Member{name: {UsageTracker.Cluster.Supervisor, :"usage_tracker@10.15.170.33"}, status: :alive}}, {:add, {:member, {UsageTracker.Cluster.Supervisor, :"usage_tracker@10.15.170.33"}}, 1}, {:add, {:process, 298939904021508651051355589863370013820}, {{UsageTracker.Cluster.Supervisor, :"usage_tracker@10.15.170.33"}, %{id: 298939904021508651051355589863370013820, restart: :transient, start: {UsageTracker.Prober, :start_link, [[event_listeners: [{"CHANNEL_HANGUP_COMPLETE", UsageTracker.Aggregators.MOS}], pubsub: :pubsub, group: :global, target: "10.15.170.4", node: "tel-sy1-ibm-prod-065", datacenter: "sy1-prod", name: {:via, UsageTracker.Cluster, {:fs_prober, "tel-sy1-ibm-prod-065", "sy1-prod", "10.15.170.4"}}]]}}, #PID<38772.5938.0>}}, {:add, {:process, 158968187617642545362566256197371154665}, {{UsageTracker.Cluster.Supervisor, :"usage_tracker@10.15.170.33"}, %{id: 158968187617642545362566256197371154665, start: {UsageTracker.ServiceDiscovery, :start_link, [[consul: "consul.service.sy1-prod.consul:28500", service: "b2bua", prober_starter: UsageTracker.Prober.Starter, name: {:via, UsageTracker.Cluster, {:service_discovery, "sy1-prod", "consul.service.sy1-prod.consul:28500", "b2bua"}}]]}}, #PID<38772.5929.0>}}]}
17:37:49.501 [error] GenServer #PID<0.3693.0> terminating
** (stop) exited in: GenServer.call(UsageTracker.Cluster.Supervisor, :horde_shutting_down, 5000)
    ** (EXIT) exited in: GenServer.call(UsageTracker.Cluster.Supervisor.ProcessesSupervisor, {:send_exit_signal, #PID<0.3770.0>, {:shutdown, :process_redistribution}}, 5000)
        ** (EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started
    (elixir 1.10.4) lib/gen_server.ex:1023: GenServer.call/3
    (horde 0.8.2) lib/horde/signal_shutdown.ex:21: anonymous fn/1 in Horde.SignalShutdown.terminate/2
    (elixir 1.10.4) lib/enum.ex:783: Enum."-each/2-lists^foreach/1-0-"/2
    (stdlib 3.13) gen_server.erl:718: :gen_server.try_terminate/3
    (stdlib 3.13) gen_server.erl:903: :gen_server.terminate/10
    (stdlib 3.13) proc_lib.erl:226: :proc_lib.init_p_do_apply/3
Last message: {:EXIT, #PID<0.3689.0>, :shutdown}
17:37:49.502 [info] Starting Horde.DynamicSupervisorImpl with name UsageTracker.Cluster.Supervisor

In the above sequence, the line name conflict, shutdown is a simple log under handle_info({:EXIT, _pid, {:name_conflict, ...}}, state), which stops the process with {:stop, :shutdown, state}. Therefore the process stops and right after the dynamic supervisor takes place attempting to exit the process that doesn't exist anymore.

Is this behaviour by design or would it be a bug?

derekkraan commented 3 years ago

This seems like the same issue that caused us to randomize child_ids. I'm not sure if this feature (active redistribution from the POV of Horde.DynamicSupervisor) can ever work properly. Might remove the feature.