Horde 0.7.1, Elixir 1.9.2, libcluster 3.1.1, in a two-node cluster deployed to ECS.
I've been consistently seeing a weird issue that occurs anytime a node gets removed from my Horde cluster (the terminating node calls :init.stop(), the other nodes don't get advance warning otherwise). On the terminating node, there's this in the logs right before it shuts down:
[2019-11-10T23:37:14-08:00] 07:37:14.025 [error] GenServer #PID<0.4502.0> terminating
[2019-11-10T23:37:14-08:00] ** (stop) exited in: GenServer.call(Kiwi.DistributedSupervisor, :horde_shutting_down, 5000)
[2019-11-10T23:37:14-08:00] ** (EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started
[2019-11-10T23:37:14-08:00] (elixir) lib/gen_server.ex:999: GenServer.call/3
[2019-11-10T23:37:14-08:00] (horde) lib/horde/signal_shutdown.ex:21: anonymous fn/1 in Horde.SignalShutdown.terminate/2
[2019-11-10T23:37:14-08:00] (elixir) lib/enum.ex:783: Enum."-each/2-lists^foreach/1-0-"/2
[2019-11-10T23:37:14-08:00] (elixir) lib/enum.ex:783: Enum.each/2
[2019-11-10T23:37:14-08:00] (stdlib) gen_server.erl:673: :gen_server.try_terminate/3
[2019-11-10T23:37:14-08:00] (stdlib) gen_server.erl:858: :gen_server.terminate/10
[2019-11-10T23:37:14-08:00] (stdlib) proc_lib.erl:249: :proc_lib.init_p_do_apply/3
[2019-11-10T23:37:14-08:00] Last message: {:EXIT, #PID<0.4115.0>, :shutdown}
[2019-11-10T23:37:14-08:00] 07:37:14.025 [info] Starting Horde.DynamicSupervisorImpl with name Kiwi.DistributedSupervisor
and on the other (alive) node in the cluster:
[2019-11-10T23:37:23-08:00] 07:37:23.985 [info] [libcluster:kiwi] disconnected from :"kiwi@10.0.1.252"
[2019-11-10T23:37:24-08:00] 07:37:24.005 [error] GenServer #PID<0.4120.0> terminating
[2019-11-10T23:37:24-08:00] ** (stop) exited in: GenServer.call(Kiwi.DistributedSupervisor, :horde_shutting_down, 5000)
[2019-11-10T23:37:24-08:00] ** (EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started
[2019-11-10T23:37:24-08:00] (elixir) lib/gen_server.ex:999: GenServer.call/3
[2019-11-10T23:37:24-08:00] (horde) lib/horde/signal_shutdown.ex:21: anonymous fn/1 in Horde.SignalShutdown.terminate/2
[2019-11-10T23:37:24-08:00] (elixir) lib/enum.ex:783: Enum."-each/2-lists^foreach/1-0-"/2
[2019-11-10T23:37:24-08:00] (elixir) lib/enum.ex:783: Enum.each/2
[2019-11-10T23:37:24-08:00] (stdlib) gen_server.erl:673: :gen_server.try_terminate/3
[2019-11-10T23:37:24-08:00] (stdlib) gen_server.erl:858: :gen_server.terminate/10
[2019-11-10T23:37:24-08:00] (stdlib) proc_lib.erl:249: :proc_lib.init_p_do_apply/3
[2019-11-10T23:37:24-08:00] Last message: {:EXIT, #PID<0.4115.0>, :shutdown}
[2019-11-10T23:37:24-08:00] 07:37:24.005 [error] GenServer #PID<0.4122.0> terminating
[2019-11-10T23:37:24-08:00] ** (stop) exited in: GenServer.call(Kiwi.DistributedSupervisor, {:set_members, [{Kiwi.DistributedSupervisor, :"kiwi@10.0.3.247"}, {Kiwi.DistributedSupervisor, :"kiwi@10.0.1.252"}]}, 5000)
[2019-11-10T23:37:24-08:00] ** (EXIT) exited in: GenServer.call(Kiwi.DistributedSupervisor.ProcessesSupervisor, {:start_child, {195127511086500842897340379111815543100, {Kiwi.Periodic.CleanSessions, :start_link, []}, :permanent, 5000, :worker, [Kiwi.Periodic.CleanSessions]}}, :infinity)
[2019-11-10T23:37:24-08:00] ** (EXIT) shutdown
[2019-11-10T23:37:24-08:00] (elixir) lib/gen_server.ex:1009: GenServer.call/3
[2019-11-10T23:37:24-08:00] (kiwi) lib/kiwi/distributed_listener.ex:32: Kiwi.DistributedListener.set_members/1
[2019-11-10T23:37:24-08:00] (kiwi) lib/kiwi/distributed_listener.ex:13: Kiwi.DistributedListener.handle_info/2
[2019-11-10T23:37:24-08:00] (stdlib) gen_server.erl:637: :gen_server.try_dispatch/4
[2019-11-10T23:37:24-08:00] (stdlib) gen_server.erl:711: :gen_server.handle_msg/6
[2019-11-10T23:37:24-08:00] (stdlib) proc_lib.erl:249: :proc_lib.init_p_do_apply/3
[2019-11-10T23:37:24-08:00] Last message: {:nodeup, :"kiwi@10.0.1.252", [node_type: :visible]}
[2019-11-10T23:37:24-08:00] 07:37:24.005 [info] Starting Horde.DynamicSupervisorImpl with name Kiwi.DistributedSupervisor
Kiwi.DistributedSupervisor is just a named DynamicSupervisor and Kiwi.DistributedListener is based on the example in the docs and calls set_members in response to nodeup/nodedown messages.
It seems after this situation everything gets back to a normal state right away, and things recover well on the remaining node, but the errors in the logs are tripping up monitoring alarms, so just wanted to see if there's something I might be doing wrong or if there's a bug.
Horde 0.7.1, Elixir 1.9.2, libcluster 3.1.1, in a two-node cluster deployed to ECS.
I've been consistently seeing a weird issue that occurs anytime a node gets removed from my Horde cluster (the terminating node calls :init.stop(), the other nodes don't get advance warning otherwise). On the terminating node, there's this in the logs right before it shuts down:
and on the other (alive) node in the cluster:
Kiwi.DistributedSupervisor
is just a named DynamicSupervisor andKiwi.DistributedListener
is based on the example in the docs and callsset_members
in response to nodeup/nodedown messages.It seems after this situation everything gets back to a normal state right away, and things recover well on the remaining node, but the errors in the logs are tripping up monitoring alarms, so just wanted to see if there's something I might be doing wrong or if there's a bug.
Thank you!