Open x-ji opened 4 years ago
Ah hmm, I wonder if it works to start a Horde.Registry or Horde.DynamicSupervisor and tell it that it will not be part of the cluster.
Right, so this is what happened in this case, and apparently the new Registry/DynamicSupervisor will still try to join the cluster regardless of the static list, which doesn't actually include it.
I guess this is not the intended usage of the static cluster membership. We were trying to use dynamic cluster membership but it didn't work out. Scaling to 4 replicas was also more of a hypothetical test which shouldn't happen in a real k8s cluster with a fixed number of replicas.
Still, I wonder if it would be possible to do something in this case and prevent the new Registry/DynamicSupervisor from joining, or maybe just shut it down if it has a members
option which doesn't include itself? Not sure how complicated it would be to implement. Or alternatively, whether it would make sense to mention this scenario in the documentation.
By the way, when I tried to scale down from 4 to 3 again, an (EXIT) no process
error similar to https://github.com/derekkraan/horde/issues/202 happened on all 3 of the remaining nodes.
** (stop) exited in: GenServer.stop(Assistant.Inbox.Sync.Supervisor.ProcessesSupervisor, :normal, :infinity)
** (EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started
(elixir 1.10.2) lib/gen_server.ex:971: GenServer.stop/3
(horde 0.8.1) lib/horde/dynamic_supervisor_impl.ex:605: Horde.DynamicSupervisorImpl.shut_down_all_processes/1
(horde 0.8.1) lib/horde/dynamic_supervisor_impl.ex:374: Horde.DynamicSupervisorImpl.handle_info/2
(stdlib 3.11.2) gen_server.erl:637: :gen_server.try_dispatch/4
(stdlib 3.11.2) gen_server.erl:711: :gen_server.handle_msg/6
(stdlib 3.11.2) proc_lib.erl:249: :proc_lib.init_p_do_apply/3
Last message: {:crdt_update, [{:add, {:member_node_info, {Assistant.Inbox.Sync.Supervisor, :"assistant@assistant-service-0.assistant-service-headless.review-ka-1621-te-gng7km.svc.cluster.local"}}, %Horde.DynamicSupervisor.Member{name: {Assistant.Inbox.Sync.Supervisor, :"assistant@assistant-service-0.assistant-service-headless.review-ka-1621-te-gng7km.svc.cluster.local"}, status: :shutting_down}}]}
11:46:18.377 [info] Starting Horde.DynamicSupervisorImpl with name Assistant.Inbox.Sync.Supervisor
11:46:18.371 [error] Supervisor 'Elixir.Assistant.Inbox.Sync.Supervisor.Supervisor' had child 'Elixir.Assistant.Inbox.Sync.Supervisor.ProcessesSupervisor' started with 'Elixir.Horde.ProcessesSupervisor':start_link([{shutdown,infinity},{root_name,'Elixir.Assistant.Inbox.Sync.Supervisor'},{type,supervisor},{name,...},...]) at <0.9760.0> exit with reason normal in context child_terminated
11:46:18.374 [error] gen_server 'Elixir.Assistant.Inbox.Sync.Supervisor' terminated with reason: no such process or port in call to 'Elixir.GenServer':stop('Elixir.Assistant.Inbox.Sync.Supervisor.ProcessesSupervisor', normal, infinity) in 'Elixir.GenServer':stop/3 line 971
11:46:18.376 [error] CRASH REPORT Process 'Elixir.Assistant.Inbox.Sync.Supervisor' with 0 neighbours exited with reason: no such process or port in call to 'Elixir.GenServer':stop('Elixir.Assistant.Inbox.Sync.Supervisor.ProcessesSupervisor', normal, infinity) in 'Elixir.GenServer':stop/3 line 971
11:46:18.377 [error] Supervisor 'Elixir.Assistant.Inbox.Sync.Supervisor.Supervisor' had child 'Elixir.Horde.DynamicSupervisorImpl' started with 'Elixir.Horde.DynamicSupervisorImpl':start_link([{name,'Elixir.Assistant.Inbox.Sync.Supervisor'},{root_name,'Elixir.Assistant.Inbox.Sync.Supervisor'},...]) at <0.9758.0> exit with reason no such process or port in call to 'Elixir.GenServer':stop('Elixir.Assistant.Inbox.Sync.Supervisor.ProcessesSupervisor', normal, infinity) in context shutdown_error
11:49:02.666 [warn] [libcluster:assistant_service] unable to connect to :"assistant@assistant-service-2.assistant-service-headless.review-ka-1621-te-gng7km.svc.cluster.local"
11:49:02.673 [warn] [libcluster:assistant_service] unable to connect to :"assistant@assistant-service-2.assistant-service-headless.review-ka-1621-te-gng7km.svc.cluster.local"
11:49:12.673 [warn] [libcluster:assistant_service] unable to connect to :"assistant@assistant-service-2.assistant-service-headless.review-ka-1621-te-gng7km.svc.cluster.local"
11:49:12.679 [warn] [libcluster:assistant_service] unable to connect to :"assistant@assistant-service-2.assistant-service-headless.review-ka-1621-te-gng7km.svc.cluster.local"
So I guess this scenario is probably something unexpected for Horde.
I think you're right that this should at the very least be included in the documentation.
I suppose it would be possible to check whether an instance of Horde.Registry was in its own list of members (I guess by ensuring that at least one of the members resolved to self()
in the presence of Process.whereis/1
or equivalent). What would be the correct behaviour if the condition was not met? Raising an error?
By the way, when I tried to scale down from 4 to 3 again, an
(EXIT) no process
error similar to #202 happened on all 3 of the remaining nodes.** (stop) exited in: GenServer.stop(Assistant.Inbox.Sync.Supervisor.ProcessesSupervisor, :normal, :infinity) ** (EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started (elixir 1.10.2) lib/gen_server.ex:971: GenServer.stop/3 (horde 0.8.1) lib/horde/dynamic_supervisor_impl.ex:605: Horde.DynamicSupervisorImpl.shut_down_all_processes/1 (horde 0.8.1) lib/horde/dynamic_supervisor_impl.ex:374: Horde.DynamicSupervisorImpl.handle_info/2 (stdlib 3.11.2) gen_server.erl:637: :gen_server.try_dispatch/4 (stdlib 3.11.2) gen_server.erl:711: :gen_server.handle_msg/6 (stdlib 3.11.2) proc_lib.erl:249: :proc_lib.init_p_do_apply/3 Last message: {:crdt_update, [{:add, {:member_node_info, {Assistant.Inbox.Sync.Supervisor, :"assistant@assistant-service-0.assistant-service-headless.review-ka-1621-te-gng7km.svc.cluster.local"}}, %Horde.DynamicSupervisor.Member{name: {Assistant.Inbox.Sync.Supervisor, :"assistant@assistant-service-0.assistant-service-headless.review-ka-1621-te-gng7km.svc.cluster.local"}, status: :shutting_down}}]} 11:46:18.377 [info] Starting Horde.DynamicSupervisorImpl with name Assistant.Inbox.Sync.Supervisor 11:46:18.371 [error] Supervisor 'Elixir.Assistant.Inbox.Sync.Supervisor.Supervisor' had child 'Elixir.Assistant.Inbox.Sync.Supervisor.ProcessesSupervisor' started with 'Elixir.Horde.ProcessesSupervisor':start_link([{shutdown,infinity},{root_name,'Elixir.Assistant.Inbox.Sync.Supervisor'},{type,supervisor},{name,...},...]) at <0.9760.0> exit with reason normal in context child_terminated 11:46:18.374 [error] gen_server 'Elixir.Assistant.Inbox.Sync.Supervisor' terminated with reason: no such process or port in call to 'Elixir.GenServer':stop('Elixir.Assistant.Inbox.Sync.Supervisor.ProcessesSupervisor', normal, infinity) in 'Elixir.GenServer':stop/3 line 971 11:46:18.376 [error] CRASH REPORT Process 'Elixir.Assistant.Inbox.Sync.Supervisor' with 0 neighbours exited with reason: no such process or port in call to 'Elixir.GenServer':stop('Elixir.Assistant.Inbox.Sync.Supervisor.ProcessesSupervisor', normal, infinity) in 'Elixir.GenServer':stop/3 line 971 11:46:18.377 [error] Supervisor 'Elixir.Assistant.Inbox.Sync.Supervisor.Supervisor' had child 'Elixir.Horde.DynamicSupervisorImpl' started with 'Elixir.Horde.DynamicSupervisorImpl':start_link([{name,'Elixir.Assistant.Inbox.Sync.Supervisor'},{root_name,'Elixir.Assistant.Inbox.Sync.Supervisor'},...]) at <0.9758.0> exit with reason no such process or port in call to 'Elixir.GenServer':stop('Elixir.Assistant.Inbox.Sync.Supervisor.ProcessesSupervisor', normal, infinity) in context shutdown_error 11:49:02.666 [warn] [libcluster:assistant_service] unable to connect to :"assistant@assistant-service-2.assistant-service-headless.review-ka-1621-te-gng7km.svc.cluster.local" 11:49:02.673 [warn] [libcluster:assistant_service] unable to connect to :"assistant@assistant-service-2.assistant-service-headless.review-ka-1621-te-gng7km.svc.cluster.local" 11:49:12.673 [warn] [libcluster:assistant_service] unable to connect to :"assistant@assistant-service-2.assistant-service-headless.review-ka-1621-te-gng7km.svc.cluster.local" 11:49:12.679 [warn] [libcluster:assistant_service] unable to connect to :"assistant@assistant-service-2.assistant-service-headless.review-ka-1621-te-gng7km.svc.cluster.local"
So I guess this scenario is probably something unexpected for Horde.
I believe this is a different scenario to #202. At least, the stacktrace does not match. I have looked into this before, but couldn't find anything obvious. I hope this isn't happening to people on a regular basis, it should be possible to reduce the size of your horde cluster without the whole thing falling apart.
We are now trying to use static cluster membership, since the dynamic cluster membership seems to be causing issues when one k8s pod becomes temporarily invisible, probably due to some automatic k8s maintenance operations (we're using libcluster's
Kubernetes.DNSSRV
strategy).The
members
argument is specified as a list:The setup is similar for the DynamicSupervisor module.
We have a stateful set deployment with 3 replicas. When I try to scale down the pods with
kubectl scale statefulset service --replica 2
, things seem to work as expected. If I callHorde.Cluster.members(App.Module.Registry)
, I still see the original list.However, if I try to scale up with
kubectl scale statefulset service --replica 4
, the Registry spun up on the new node seems to still join the cluster for some reason. If I runHorde.Cluster.members(App.Module.Registry)
, I see the extra entry{App.Module.Registry, :"app@app-service-3.app-service-headless.#{namespace}.svc.cluster.local"}
.Interestingly, even if I scale back to 3 again, that extra Registry remains in the members list, while the extra DynamicSupervisor is gone.
Is this the expected behavior? From the documentation, I thought that Horde should only try to find the members listed in the static list, and not try to add new members to that list. I would expect the Registry and DynamicSupervisor on the fourth node be ignored.
We're using the
Horde.UniformQuorumDistribution
strategy for the Supervisor though I feel that should be irrelevant to the membership issue.