derekkraan / horde

Horde is a distributed Supervisor and Registry backed by DeltaCrdt
MIT License
1.32k stars 106 forks source link

Distributed supervisor crashing on cluster changes #180

Closed dkulchenko closed 4 years ago

dkulchenko commented 5 years ago

Horde 0.7.1, Elixir 1.9.2, libcluster 3.1.1, in a two-node cluster deployed to ECS.

I've been consistently seeing a weird issue that occurs anytime a node gets removed from my Horde cluster (the terminating node calls :init.stop(), the other nodes don't get advance warning otherwise). On the terminating node, there's this in the logs right before it shuts down:

[2019-11-10T23:37:14-08:00] 07:37:14.025 [error] GenServer #PID<0.4502.0> terminating
[2019-11-10T23:37:14-08:00] ** (stop) exited in: GenServer.call(Kiwi.DistributedSupervisor, :horde_shutting_down, 5000)
[2019-11-10T23:37:14-08:00] ** (EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started
[2019-11-10T23:37:14-08:00] (elixir) lib/gen_server.ex:999: GenServer.call/3
[2019-11-10T23:37:14-08:00] (horde) lib/horde/signal_shutdown.ex:21: anonymous fn/1 in Horde.SignalShutdown.terminate/2
[2019-11-10T23:37:14-08:00] (elixir) lib/enum.ex:783: Enum."-each/2-lists^foreach/1-0-"/2
[2019-11-10T23:37:14-08:00] (elixir) lib/enum.ex:783: Enum.each/2
[2019-11-10T23:37:14-08:00] (stdlib) gen_server.erl:673: :gen_server.try_terminate/3
[2019-11-10T23:37:14-08:00] (stdlib) gen_server.erl:858: :gen_server.terminate/10
[2019-11-10T23:37:14-08:00] (stdlib) proc_lib.erl:249: :proc_lib.init_p_do_apply/3
[2019-11-10T23:37:14-08:00] Last message: {:EXIT, #PID<0.4115.0>, :shutdown}
[2019-11-10T23:37:14-08:00] 07:37:14.025 [info] Starting Horde.DynamicSupervisorImpl with name Kiwi.DistributedSupervisor

and on the other (alive) node in the cluster:

[2019-11-10T23:37:23-08:00] 07:37:23.985 [info] [libcluster:kiwi] disconnected from :"kiwi@10.0.1.252"
[2019-11-10T23:37:24-08:00] 07:37:24.005 [error] GenServer #PID<0.4120.0> terminating
[2019-11-10T23:37:24-08:00] ** (stop) exited in: GenServer.call(Kiwi.DistributedSupervisor, :horde_shutting_down, 5000)
[2019-11-10T23:37:24-08:00] ** (EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started
[2019-11-10T23:37:24-08:00] (elixir) lib/gen_server.ex:999: GenServer.call/3
[2019-11-10T23:37:24-08:00]   (horde) lib/horde/signal_shutdown.ex:21: anonymous fn/1 in Horde.SignalShutdown.terminate/2
[2019-11-10T23:37:24-08:00]    (elixir) lib/enum.ex:783: Enum."-each/2-lists^foreach/1-0-"/2
[2019-11-10T23:37:24-08:00]    (elixir) lib/enum.ex:783: Enum.each/2
[2019-11-10T23:37:24-08:00]    (stdlib) gen_server.erl:673: :gen_server.try_terminate/3
[2019-11-10T23:37:24-08:00]    (stdlib) gen_server.erl:858: :gen_server.terminate/10
[2019-11-10T23:37:24-08:00]     (stdlib) proc_lib.erl:249: :proc_lib.init_p_do_apply/3
[2019-11-10T23:37:24-08:00]   Last message: {:EXIT, #PID<0.4115.0>, :shutdown}
[2019-11-10T23:37:24-08:00]   07:37:24.005 [error] GenServer #PID<0.4122.0> terminating
[2019-11-10T23:37:24-08:00]   ** (stop) exited in: GenServer.call(Kiwi.DistributedSupervisor, {:set_members, [{Kiwi.DistributedSupervisor, :"kiwi@10.0.3.247"}, {Kiwi.DistributedSupervisor, :"kiwi@10.0.1.252"}]}, 5000)
[2019-11-10T23:37:24-08:00]   ** (EXIT) exited in: GenServer.call(Kiwi.DistributedSupervisor.ProcessesSupervisor, {:start_child, {195127511086500842897340379111815543100, {Kiwi.Periodic.CleanSessions, :start_link, []}, :permanent, 5000, :worker, [Kiwi.Periodic.CleanSessions]}}, :infinity)
[2019-11-10T23:37:24-08:00]    ** (EXIT) shutdown
[2019-11-10T23:37:24-08:00]     (elixir) lib/gen_server.ex:1009: GenServer.call/3
[2019-11-10T23:37:24-08:00]     (kiwi) lib/kiwi/distributed_listener.ex:32: Kiwi.DistributedListener.set_members/1
[2019-11-10T23:37:24-08:00]     (kiwi) lib/kiwi/distributed_listener.ex:13: Kiwi.DistributedListener.handle_info/2
[2019-11-10T23:37:24-08:00]    (stdlib) gen_server.erl:637: :gen_server.try_dispatch/4
[2019-11-10T23:37:24-08:00]    (stdlib) gen_server.erl:711: :gen_server.handle_msg/6
[2019-11-10T23:37:24-08:00]    (stdlib) proc_lib.erl:249: :proc_lib.init_p_do_apply/3
[2019-11-10T23:37:24-08:00]   Last message: {:nodeup, :"kiwi@10.0.1.252", [node_type: :visible]}
[2019-11-10T23:37:24-08:00]  07:37:24.005 [info] Starting Horde.DynamicSupervisorImpl with name Kiwi.DistributedSupervisor

Kiwi.DistributedSupervisor is just a named DynamicSupervisor and Kiwi.DistributedListener is based on the example in the docs and calls set_members in response to nodeup/nodedown messages.

It seems after this situation everything gets back to a normal state right away, and things recover well on the remaining node, but the errors in the logs are tripping up monitoring alarms, so just wanted to see if there's something I might be doing wrong or if there's a bug.

Thank you!

dkulchenko commented 4 years ago

This was my fault for not reading the docs carefully enough - I needed to properly handle {:error, :already_started} in my GenServers.