derekkraan / horde

Horde is a distributed Supervisor and Registry backed by DeltaCrdt
MIT License
1.28k stars 101 forks source link

Static cluster membership set, but when a new node outside of the list joins, its Registry and DynamicSupervisor still joins the cluster? #210

Open x-ji opened 3 years ago

x-ji commented 3 years ago

We are now trying to use static cluster membership, since the dynamic cluster membership seems to be causing issues when one k8s pod becomes temporarily invisible, probably due to some automatic k8s maintenance operations (we're using libcluster's Kubernetes.DNSSRV strategy).

The members argument is specified as a list:

      [
        {App.Module.Registry,
         :"app@app-service-0.app-service-headless.#{namespace}.svc.cluster.local"},
        {App.Module.Registry,
         :"app@app-service-1.app-service-headless.#{namespace}.svc.cluster.local"},
        {App.Module.Registry,
         :"app@app-service-2.app-service-headless.#{namespace}.svc.cluster.local"}
      ]

The setup is similar for the DynamicSupervisor module.

We have a stateful set deployment with 3 replicas. When I try to scale down the pods with kubectl scale statefulset service --replica 2, things seem to work as expected. If I call Horde.Cluster.members(App.Module.Registry), I still see the original list.

However, if I try to scale up with kubectl scale statefulset service --replica 4, the Registry spun up on the new node seems to still join the cluster for some reason. If I run Horde.Cluster.members(App.Module.Registry), I see the extra entry {App.Module.Registry, :"app@app-service-3.app-service-headless.#{namespace}.svc.cluster.local"}.

Interestingly, even if I scale back to 3 again, that extra Registry remains in the members list, while the extra DynamicSupervisor is gone.

Is this the expected behavior? From the documentation, I thought that Horde should only try to find the members listed in the static list, and not try to add new members to that list. I would expect the Registry and DynamicSupervisor on the fourth node be ignored.

We're using the Horde.UniformQuorumDistribution strategy for the Supervisor though I feel that should be irrelevant to the membership issue.

derekkraan commented 3 years ago

Ah hmm, I wonder if it works to start a Horde.Registry or Horde.DynamicSupervisor and tell it that it will not be part of the cluster.

x-ji commented 3 years ago

Right, so this is what happened in this case, and apparently the new Registry/DynamicSupervisor will still try to join the cluster regardless of the static list, which doesn't actually include it.

I guess this is not the intended usage of the static cluster membership. We were trying to use dynamic cluster membership but it didn't work out. Scaling to 4 replicas was also more of a hypothetical test which shouldn't happen in a real k8s cluster with a fixed number of replicas.

Still, I wonder if it would be possible to do something in this case and prevent the new Registry/DynamicSupervisor from joining, or maybe just shut it down if it has a members option which doesn't include itself? Not sure how complicated it would be to implement. Or alternatively, whether it would make sense to mention this scenario in the documentation.

x-ji commented 3 years ago

By the way, when I tried to scale down from 4 to 3 again, an (EXIT) no process error similar to https://github.com/derekkraan/horde/issues/202 happened on all 3 of the remaining nodes.

** (stop) exited in: GenServer.stop(Assistant.Inbox.Sync.Supervisor.ProcessesSupervisor, :normal, :infinity)
    ** (EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started
    (elixir 1.10.2) lib/gen_server.ex:971: GenServer.stop/3
    (horde 0.8.1) lib/horde/dynamic_supervisor_impl.ex:605: Horde.DynamicSupervisorImpl.shut_down_all_processes/1
    (horde 0.8.1) lib/horde/dynamic_supervisor_impl.ex:374: Horde.DynamicSupervisorImpl.handle_info/2
    (stdlib 3.11.2) gen_server.erl:637: :gen_server.try_dispatch/4
    (stdlib 3.11.2) gen_server.erl:711: :gen_server.handle_msg/6
    (stdlib 3.11.2) proc_lib.erl:249: :proc_lib.init_p_do_apply/3
Last message: {:crdt_update, [{:add, {:member_node_info, {Assistant.Inbox.Sync.Supervisor, :"assistant@assistant-service-0.assistant-service-headless.review-ka-1621-te-gng7km.svc.cluster.local"}}, %Horde.DynamicSupervisor.Member{name: {Assistant.Inbox.Sync.Supervisor, :"assistant@assistant-service-0.assistant-service-headless.review-ka-1621-te-gng7km.svc.cluster.local"}, status: :shutting_down}}]}
11:46:18.377 [info] Starting Horde.DynamicSupervisorImpl with name Assistant.Inbox.Sync.Supervisor
11:46:18.371 [error] Supervisor 'Elixir.Assistant.Inbox.Sync.Supervisor.Supervisor' had child 'Elixir.Assistant.Inbox.Sync.Supervisor.ProcessesSupervisor' started with 'Elixir.Horde.ProcessesSupervisor':start_link([{shutdown,infinity},{root_name,'Elixir.Assistant.Inbox.Sync.Supervisor'},{type,supervisor},{name,...},...]) at <0.9760.0> exit with reason normal in context child_terminated
11:46:18.374 [error] gen_server 'Elixir.Assistant.Inbox.Sync.Supervisor' terminated with reason: no such process or port in call to 'Elixir.GenServer':stop('Elixir.Assistant.Inbox.Sync.Supervisor.ProcessesSupervisor', normal, infinity) in 'Elixir.GenServer':stop/3 line 971
11:46:18.376 [error] CRASH REPORT Process 'Elixir.Assistant.Inbox.Sync.Supervisor' with 0 neighbours exited with reason: no such process or port in call to 'Elixir.GenServer':stop('Elixir.Assistant.Inbox.Sync.Supervisor.ProcessesSupervisor', normal, infinity) in 'Elixir.GenServer':stop/3 line 971
11:46:18.377 [error] Supervisor 'Elixir.Assistant.Inbox.Sync.Supervisor.Supervisor' had child 'Elixir.Horde.DynamicSupervisorImpl' started with 'Elixir.Horde.DynamicSupervisorImpl':start_link([{name,'Elixir.Assistant.Inbox.Sync.Supervisor'},{root_name,'Elixir.Assistant.Inbox.Sync.Supervisor'},...]) at <0.9758.0> exit with reason no such process or port in call to 'Elixir.GenServer':stop('Elixir.Assistant.Inbox.Sync.Supervisor.ProcessesSupervisor', normal, infinity) in context shutdown_error
11:49:02.666 [warn] [libcluster:assistant_service] unable to connect to :"assistant@assistant-service-2.assistant-service-headless.review-ka-1621-te-gng7km.svc.cluster.local"
11:49:02.673 [warn] [libcluster:assistant_service] unable to connect to :"assistant@assistant-service-2.assistant-service-headless.review-ka-1621-te-gng7km.svc.cluster.local"
11:49:12.673 [warn] [libcluster:assistant_service] unable to connect to :"assistant@assistant-service-2.assistant-service-headless.review-ka-1621-te-gng7km.svc.cluster.local"
11:49:12.679 [warn] [libcluster:assistant_service] unable to connect to :"assistant@assistant-service-2.assistant-service-headless.review-ka-1621-te-gng7km.svc.cluster.local"

So I guess this scenario is probably something unexpected for Horde.

derekkraan commented 3 years ago

I think you're right that this should at the very least be included in the documentation.

I suppose it would be possible to check whether an instance of Horde.Registry was in its own list of members (I guess by ensuring that at least one of the members resolved to self() in the presence of Process.whereis/1 or equivalent). What would be the correct behaviour if the condition was not met? Raising an error?

derekkraan commented 3 years ago

By the way, when I tried to scale down from 4 to 3 again, an (EXIT) no process error similar to #202 happened on all 3 of the remaining nodes.

** (stop) exited in: GenServer.stop(Assistant.Inbox.Sync.Supervisor.ProcessesSupervisor, :normal, :infinity)
    ** (EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started
    (elixir 1.10.2) lib/gen_server.ex:971: GenServer.stop/3
    (horde 0.8.1) lib/horde/dynamic_supervisor_impl.ex:605: Horde.DynamicSupervisorImpl.shut_down_all_processes/1
    (horde 0.8.1) lib/horde/dynamic_supervisor_impl.ex:374: Horde.DynamicSupervisorImpl.handle_info/2
    (stdlib 3.11.2) gen_server.erl:637: :gen_server.try_dispatch/4
    (stdlib 3.11.2) gen_server.erl:711: :gen_server.handle_msg/6
    (stdlib 3.11.2) proc_lib.erl:249: :proc_lib.init_p_do_apply/3
Last message: {:crdt_update, [{:add, {:member_node_info, {Assistant.Inbox.Sync.Supervisor, :"assistant@assistant-service-0.assistant-service-headless.review-ka-1621-te-gng7km.svc.cluster.local"}}, %Horde.DynamicSupervisor.Member{name: {Assistant.Inbox.Sync.Supervisor, :"assistant@assistant-service-0.assistant-service-headless.review-ka-1621-te-gng7km.svc.cluster.local"}, status: :shutting_down}}]}
11:46:18.377 [info] Starting Horde.DynamicSupervisorImpl with name Assistant.Inbox.Sync.Supervisor
11:46:18.371 [error] Supervisor 'Elixir.Assistant.Inbox.Sync.Supervisor.Supervisor' had child 'Elixir.Assistant.Inbox.Sync.Supervisor.ProcessesSupervisor' started with 'Elixir.Horde.ProcessesSupervisor':start_link([{shutdown,infinity},{root_name,'Elixir.Assistant.Inbox.Sync.Supervisor'},{type,supervisor},{name,...},...]) at <0.9760.0> exit with reason normal in context child_terminated
11:46:18.374 [error] gen_server 'Elixir.Assistant.Inbox.Sync.Supervisor' terminated with reason: no such process or port in call to 'Elixir.GenServer':stop('Elixir.Assistant.Inbox.Sync.Supervisor.ProcessesSupervisor', normal, infinity) in 'Elixir.GenServer':stop/3 line 971
11:46:18.376 [error] CRASH REPORT Process 'Elixir.Assistant.Inbox.Sync.Supervisor' with 0 neighbours exited with reason: no such process or port in call to 'Elixir.GenServer':stop('Elixir.Assistant.Inbox.Sync.Supervisor.ProcessesSupervisor', normal, infinity) in 'Elixir.GenServer':stop/3 line 971
11:46:18.377 [error] Supervisor 'Elixir.Assistant.Inbox.Sync.Supervisor.Supervisor' had child 'Elixir.Horde.DynamicSupervisorImpl' started with 'Elixir.Horde.DynamicSupervisorImpl':start_link([{name,'Elixir.Assistant.Inbox.Sync.Supervisor'},{root_name,'Elixir.Assistant.Inbox.Sync.Supervisor'},...]) at <0.9758.0> exit with reason no such process or port in call to 'Elixir.GenServer':stop('Elixir.Assistant.Inbox.Sync.Supervisor.ProcessesSupervisor', normal, infinity) in context shutdown_error
11:49:02.666 [warn] [libcluster:assistant_service] unable to connect to :"assistant@assistant-service-2.assistant-service-headless.review-ka-1621-te-gng7km.svc.cluster.local"
11:49:02.673 [warn] [libcluster:assistant_service] unable to connect to :"assistant@assistant-service-2.assistant-service-headless.review-ka-1621-te-gng7km.svc.cluster.local"
11:49:12.673 [warn] [libcluster:assistant_service] unable to connect to :"assistant@assistant-service-2.assistant-service-headless.review-ka-1621-te-gng7km.svc.cluster.local"
11:49:12.679 [warn] [libcluster:assistant_service] unable to connect to :"assistant@assistant-service-2.assistant-service-headless.review-ka-1621-te-gng7km.svc.cluster.local"

So I guess this scenario is probably something unexpected for Horde.

I believe this is a different scenario to #202. At least, the stacktrace does not match. I have looked into this before, but couldn't find anything obvious. I hope this isn't happening to people on a regular basis, it should be possible to reduce the size of your horde cluster without the whole thing falling apart.