derekkraan / horde

Horde is a distributed Supervisor and Registry backed by DeltaCrdt
MIT License
1.32k stars 106 forks source link

ProcessesSupervisor errors in latest changes #202

Open broodfusion opened 4 years ago

broodfusion commented 4 years ago

Hi @derekkraan Seeing this type of error in our production environments with the latest changes to master (0.80-rc1) Any ideas what might be going on? Thank you

"GenServer MyApp.HordeSupervisor terminating
** (stop) exited in: GenServer.call(MyApp.HordeSupervisor.ProcessesSupervisor, {:start_child, {139043552284050170652760012527017417122, {MyApp.MyGenserverOne, :start_link, [:prod]}, :transient, 2000, :worker, [MyApp.MyGenserverOne]}}, :infinity)
    ** (EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started
    (elixir) lib/gen_server.ex:979: GenServer.call/3
    (horde) lib/horde/dynamic_supervisor_impl.ex:723: anonymous fn/2 in Horde.DynamicSupervisorImpl.add_children/2
    (elixir) lib/enum.ex:1327: Enum.\"-map/2-lists^map/1-0-\"/2
    (horde) lib/horde/dynamic_supervisor_impl.ex:722: Horde.DynamicSupervisorImpl.add_children/2
    (horde) lib/horde/dynamic_supervisor_impl.ex:717: Horde.DynamicSupervisorImpl.add_child/2
    (horde) lib/horde/dynamic_supervisor_impl.ex:458: Horde.DynamicSupervisorImpl.update_process/2
    (horde) lib/horde/dynamic_supervisor_impl.ex:447: Horde.DynamicSupervisorImpl.update_processes/2
    (horde) lib/horde/dynamic_supervisor_impl.ex:368: Horde.DynamicSupervisorImpl.handle_info/2
Last message: {:crdt_update, [{:add, {:process, 139043552284050170652760012527017417122}, {nil, %{id: 139043552284050170652760012527017417122, restart: :transient, shutdown: 2000, start: {MyApp.MyGenserverOne, :start_link, [:prod]}}}}]}
State: %Horde.DynamicSupervisorImpl{distribution_strategy: Horde.UniformDistribution, local_process_count: 2, members: %{{MyApp.HordeSupervisor, :\"my_app@10.108.0.42\"} => 1, {MyApp.HordeSupervisor, :\"my_app@10.108.0.77\"} => 1}, members_info: %{{MyApp.HordeSupervisor, :\"my_app@10.108.0.42\"} => %Horde.DynamicSupervisor.Member{name: {MyApp.HordeSupervisor, :\"my_app@10.108.0.42\"}, status: :alive}, {MyApp.HordeSupervisor, :\"my_app@10.108.0.77\"} => %Horde.DynamicSupervisor.Member{name: {MyApp.HordeSupervisor, :\"my_app@10.108.0.77\"}, status: :alive}}, name: MyApp.HordeSupervisor, name_to_supervisor_ref: %{{MyApp.HordeSupervisor, :\"my_app@10.108.0.42\"} => #Reference<0.1659528466.1457258497.180559>, {MyApp.HordeSupervisor, :\"my_app@10.108.0.77\"} => #Reference<0.1659528466.1449394178.158814>}, process_pid_to_id: %{#PID<47289.3656.0> => 60049729485557864020035464727639173863, #PID<47289.29588.0> => 308414391369408068424083375462093892042, #PID<0.22314.1> => 37220166352763660241589602860206131129, #PID<0.22494.1> => 103707630487050312880151851415886106703, #PID<47289.24146.1> => 335365555518694765551024386705011670744, #PID<47289.24150.1> => 295606065741581592997048906704954190324, #PID<47289.9123.3> => 46943250822226434472899601278135854969, #PID<47289.9126.3> => 223839736457935555235310436147616710077, #PID<0.15366.3> => 93524138938539938460752548846708010326, #PID<0.15367.3> => 228380596272365970106109926260735109927, #PID<0.12070.8> => 59359474745989843639594694291617481637, #PID<0.12079.8> => 168376634893116552353013223590516454139, #PID<47289.8406.9> => 159283425290634914259661147584541874493, #PID<47289.8412.9> => 160298175814340306462529209354042456685, #PID<0.14475.9> => 241399070627733389706845353250870576955, #PID<0.14477.9> => 75172868045228669922237070563962299494, #PID<47289.18682.10> => 40464013558738934717243880967431347573, #PID<47289.18690.10> => 28628461186772289152942804054474825187, #PID<47289.26207.11> => 145962168925592573346700231879869152411, #PID<47289.26208.11> => 72159375650083944846068501717958844224, #PID<0.8134.13> => 139043552284050170652760012527017417122, #PID<0.8144.13> => 16272980902144909298437147632394051789}, processes_by_id: %{16272980902144909298437147632394051789 => {{MyApp.HordeSupervisor, :\"my_app@10.108.0.42\"}, %{id: 16272980902144909298437147632394051789, restart: :transient, shutdown: 2000, start: {MyApp.MyGenserverTwo, :start_link, []}}, #PID<0.8144.13>}, 28628461186772289152942804054474825187 => {{MyApp.HordeSupervisor, :\"my_app@10.108.0.77\"}, %{id: 28628461186772289152942804054474825187, restart: :transient, shutdown: 2000, start: {MyApp.MyGenserverOne, :start_link, [:prod]}}, #PID<47289.18690.10>}, 37220166352763660241589602860206131129 => {{MyApp.HordeSupervisor, :\"my_app@10.108.0.42\"}, %{id: 37220166352763660241589602860206131129, restart: :transient, shutdown: 2000, start: {MyApp.MyGenserverTwo, :start_link, []}}, #PID<0.22314.1>}, 40464013558738934717243880967431347573 => {{MyApp.HordeSupervisor, :\"my_app@10.108.0.77\"}, %{id: 40464013558738934717243880967431347573, restart: :transient, shutdown: 2000, start: {MyApp.MyGenserverTwo, :start_link, []}}, #PID<47289.18682.10>}, 46943250822226434472899601278135854969 => {{MyApp.HordeSupervisor, :\"my_app@10.108.0.77\"}, %{id: 46943250822226434472899601278135854969, restart: :transient, shutdown: 2000, start: {MyApp.MyGenserverTwo, :start_link, []}}, #PID<47289.9123.3>}, 59359474745989843639594694291617481637 => {{MyApp.HordeSupervisor, :\"my_app@10.108.0.42\"}, %{id: 59359474745989843639594694291617481637, restart: :transient, shutdown: 2000, start: {MyApp.MyGenserverTwo, :start_link, []}}, #PID<0.12070.8>}, 60049729485557864020035464727639173863 => {{MyApp.HordeSupervisor, :\"my_app@10.108.0.77\"}, %{id: 60049729485557864020035464727639173863, restart: :transient, shutdown: 2000, start: {MyApp.MyGenserverTwo, :start_link, []}}, #PID<47289.3656.0>}, 72159375650083944846068501717958844224 => {{MyApp.HordeSupervisor, :\"my_app@10.108.0.77\"}, %{id: 72159375650083944846068501717958844224, restart: :transient, shutdown: 2000, start: {MyApp.MyGenserverTwo, :start_link, []}}, #PID<47289.26208.11>}, 75172868045228669922237070563962299494 => {{MyApp.HordeSupervisor, :\"my_app@10.108.0.42\"}, %{id: 75172868045228669922237070563962299494, restart: :transient, shutdown: 2000, start: {MyApp.MyGenserverOne, :start_link, [:prod]}}, #PID<0.14477.9>}, 93524138938539938460752548846708010326 => {{MyApp.HordeSupervisor, :\"my_app@10.108.0.42\"}, %{id: 93524138938539938460752548846708010326, restart: :transient, shutdown: 2000, start: {MyApp.MyGenserverTwo, :start_link, []}}, #PID<0.15366.3>}, 103707630487050312880151851415886106703 => {{MyApp.HordeSupervisor, :\"my_app@10.108.0.42\"}, %{id: 103707630487050312880151851415886106703, restart: :transient, shutdown: 2000, start: {MyApp.MyGenserverOne, :start_link, [:prod]}}, #PID<0.22494.1>}, 139043552284050170652760012527017417122 => {{MyApp.HordeSupervisor, :\"my_app@10.108.0.42\"}, %{id: 139043552284050170652760012527017417122, restart: :transient, shutdown: 2000, start: {MyApp.MyGenserverOne, :start_link, [:prod]}}, #PID<0.8134.13>}, 145962168925592573346700231879869152411 => {{MyApp.HordeSupervisor, :\"my_app@10.108.0.77\"}, %{id: 145962168925592573346700231879869152411, restart: :transient, shutdown: 2000, start: {MyApp.MyGenserverOne, :start_link, [:prod]}}, #PID<47289.26207.11>}, 159283425290634914259661147584541874493 => {{MyApp.HordeSupervisor, :\"my_app@10.108.0.77\"}, %{id: 159283425290634914259661147584541874493, restart: :transient, shutdown: 2000, start: {MyApp.MyGenserverOne, :start_link, [:prod]}}, #PID<47289.8406.9>}, 160298175814340306462529209354042456685 => {{MyApp.HordeSupervisor, :\"my_app@10.108.0.77\"}, %{id: 1602981758143403064625292093540 (truncated)"
derekkraan commented 4 years ago

Hi @broodfusion, could you try commit 98f10279754308368dc0a65e8dc0af91f42d9d50 on master and let me know if this solves the issue?

broodfusion commented 4 years ago

Hi @derekkraan thanks for pushing a hotfix. Will try it out and let you know.

broodfusion commented 4 years ago

@derekkraan we were able to upgrade to commit 98f10279754308368dc0a65e8dc0af91f42d9d50 However the error is still occurring. Same exact error as the above.

stuartish commented 4 years ago

FWIW we've been getting an identical error on commit 50a3cadf60fc330c7a436e6e0d29a05b2444c4bd (#190)

x-ji commented 4 years ago

I also get this error. This happened when a pod becomes temporarily unavailable in a k8s cluster, probably due to some automatic k8s maintenance operations (we're using libcluster's Kubernetes.DNSSRV strategy).

I was able to replicate it sometimes by first scaling up the k8s stateful replicas by 1 then scaling it down again... Is it possible that the process is confused about being unable to find some process that died/became unreachable on another node?

When this happens, all worker processes under the DynamicSupervisor are stopped, which is very annoying.

I tried to use the static cluster membership to see if it would help with the issue but the error still seems to happen.

derekkraan commented 4 years ago

@x-ji in the stack trace shown above, the process being called is on the local node. Is the error exactly the same? Could you paste in a stacktrace just to be sure? And are you running 0.8.1?

x-ji commented 4 years ago

Sure, this is the logs from one such incident where the pod disconnected for a moment (the logs are from the pod assistant-service-2. The membership is :auto and the strategy is UniformQuorumDistribution.

I am running 0.8.1

13:57:52.084 [info] [libcluster:assistant_service] disconnected from :"assistant@assistant-service-1.assistant-service-headless.production.svc.cluster.local"
13:57:52.089 [info] "kcd_inbox" already started at #PID<0.6083.0>
13:57:52.104 [info] => starting worker for kcd_inbox
13:57:52.104 [error] Supervisor 'Elixir.Assistant.Inbox.Sync.Supervisor.ProcessesSupervisor' had child 17011410246530512781312377734618215836 started with 'Elixir.Assistant.Inbox.Sync.Worker':start_link(#{account => #{'__meta__' => #{'__struct__' => 'Elixir.Ecto.Schema.Metadata',context => nil,prefix => ...,...},...}}) at <0.6083.0> exit with reason normal in context child_terminated
    (horde 0.8.1) lib/horde/dynamic_supervisor_impl.ex:723: anonymous fn/2 in Horde.DynamicSupervisorImpl.add_children/2
    (elixir 1.10.2) lib/enum.ex:2111: Enum."-reduce/3-lists^foldl/2-0-"/3
    (horde 0.8.1) lib/horde/dynamic_supervisor_impl.ex:376: Horde.DynamicSupervisorImpl.handle_info/2
Last message: {:crdt_update, [{:remove, {:member_node_info, {Assistant.Inbox.Sync.Supervisor, :"assistant@assistant-service-2.assistant-service-headless.production.svc.cluster.local"}}}, {:add, {:process, 212028067600978040211832200498216929049}, {{Assistant.Inbox.Sync.Supervisor, :"assistant@assistant-service-0.assistant-service-headless.production.svc.cluster.local"}, %{id: 212028067600978040211832200498216929049, restart: :permanent, start: {Assistant.Inbox.Sync.Worker, :start_link, [...]}}, #PID<53487.2288.5>}}, {:add, {:process, 185160046781961648370148793819973379262}, {{Assistant.Inbox.Sync.Supervisor, :"assistant@assistant-service-0.assistant-service-headless.production.svc.cluster.local"}, %{id: 185160046781961648370148793819973379262, restart: :permanent, start: {Assistant.Inbox.Sync.Worker, :start_link, [...]}}, #PID<53487.2289.5>}}, {:add, {:process, 248419496788914632477591056162044717759}, {{Assistant.Inbox.Sync.Supervisor, :"assistant@assistant-service-1.assistant-service-headless.production.svc.cluster.local"}, %{id: 248419496788914632477591056162044717759, restart: :permanent, start: {Assistant.Inbox.Sync.Worker, :start_link, [...]}}, #PID<53488.3281.5>}}, {:add, {:process, 161996265316131395332152267908071840609}, {{Assistant.Inbox.Sync.Supervisor, :"assistant@assistant-service-0.assistant-service-headless.production.svc.cluster.local"}, %{id: 161996265316131395332152267908071840609, restart: :permanent, start: {Assistant.Inbox.Sync.Worker, :start_link, [...]}}, #PID<53487.2290.5>}}, {:add, {:member_node_info, {Assistant.Inbox.Sync.Supervisor, :"assistant@assistant-service-1.assistant-service-headless.production.svc.cluster.local"}}, %Horde.DynamicSupervisor.Member{name: {Assistant.Inbox.Sync.Supervisor, :"assistant@assistant-service-1.assistant-service-headless.production.svc.cluster.local"}, status: :alive}}, {:add, {:process, 299696710888083151024182740588974423924}, {{Assistant.Inbox.Sync.Supervisor, :"assistant@assistant-service-0.assistant-service-headless.production.svc.cluster.local"}, %{id: 299696710888083151024182740588974423924, restart: :permanent, start: {Assistant.Inbox.Sync.Worker, :start_link, [...]}}, #PID<53487.2291.5>}}]}
** (stop) exited in: GenServer.call(Assistant.Inbox.Sync.Supervisor.ProcessesSupervisor, {:start_child, {312117801269135157859241902376097921414, {Assistant.Inbox.Sync.Worker, :start_link, [...]}, :permanent, 5000, :worker, [Assistant.Inbox.Sync.Worker]}}, :infinity)
    ** (EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started
    (elixir 1.10.2) lib/gen_server.ex:1013: GenServer.call/3
    (horde 0.8.1) lib/horde/dynamic_supervisor_impl.ex:722: Horde.DynamicSupervisorImpl.add_children/2
    (horde 0.8.1) lib/horde/dynamic_supervisor_impl.ex:717: Horde.DynamicSupervisorImpl.add_child/2
13:57:52.448 [error] GenServer Assistant.Inbox.Sync.Supervisor terminating
    (elixir 1.10.2) lib/enum.ex:1396: Enum."-map/2-lists^map/1-0-"/2
    (horde 0.8.1) lib/horde/dynamic_supervisor_impl.ex:426: anonymous fn/3 in Horde.DynamicSupervisorImpl.handoff_processes/1
13:57:52.441 [error] Supervisor 'Elixir.Assistant.Inbox.Sync.Supervisor.Supervisor' had child 'Elixir.Assistant.Inbox.Sync.Supervisor.ProcessesSupervisor' started with 'Elixir.Horde.ProcessesSupervisor':start_link([{shutdown,infinity},{root_name,'Elixir.Assistant.Inbox.Sync.Supervisor'},{type,supervisor},{name,...},...]) at <0.6069.0> exit with reason normal in context child_terminated

        ** (EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started
    (elixir 1.10.2) lib/enum.ex:783: Enum."-each/2-lists^foreach/1-0-"/2
    (stdlib 3.11.2) proc_lib.erl:249: :proc_lib.init_p_do_apply/3
** (stop) exited in: GenServer.call(Assistant.Inbox.Sync.Supervisor, :horde_shutting_down, 5000)
    (elixir 1.10.2) lib/gen_server.ex:1023: GenServer.call/3
    (horde 0.8.1) lib/horde/signal_shutdown.ex:21: anonymous fn/1 in Horde.SignalShutdown.terminate/2
    (elixir 1.10.2) lib/enum.ex:783: Enum.each/2
13:57:52.450 [info] Starting Horde.DynamicSupervisorImpl with name Assistant.Inbox.Sync.Supervisor
13:57:52.449 [error] GenServer #PID<0.6070.0> terminating
    ** (EXIT) exited in: GenServer.call(Assistant.Inbox.Sync.Supervisor.ProcessesSupervisor, {:start_child, {312117801269135157859241902376097921414, {Assistant.Inbox.Sync.Worker, :start_link, []}, :permanent, 5000, :worker, [Assistant.Inbox.Sync.Worker]}}, :infinity)
    (stdlib 3.11.2) gen_server.erl:673: :gen_server.try_terminate/3
    (stdlib 3.11.2) gen_server.erl:858: :gen_server.terminate/10
Last message: {:EXIT, #PID<0.6065.0>, :shutdown}
13:57:52.444 [error] gen_server 'Elixir.Assistant.Inbox.Sync.Supervisor' terminated with reason: no such process or port in call to 'Elixir.GenServer':call('Elixir.Assistant.Inbox.Sync.Supervisor.ProcessesSupervisor', {start_child,{312117801269135157859241902376097921414,{'Elixir.Assistant.Inbox.Sync.Worker',start_link,...},...}}, infinity) in 'Elixir.GenServer':call/3 line 1013

13:57:52.448 [error] CRASH REPORT Process 'Elixir.Assistant.Inbox.Sync.Supervisor' with 0 neighbours exited with reason: no such process or port in call to 'Elixir.GenServer':call('Elixir.Assistant.Inbox.Sync.Supervisor.ProcessesSupervisor', {start_child,{312117801269135157859241902376097921414,{'Elixir.Assistant.Inbox.Sync.Worker',start_link,...},...}}, infinity) in 'Elixir.GenServer':call/3 line 1013

13:57:52.449 [error] gen_server <0.6070.0> terminated with reason: {{noproc,{'Elixir.GenServer',call,['Elixir.Assistant.Inbox.Sync.Supervisor.ProcessesSupervisor',{start_child,{312117801269135157859241902376097921414,{'Elixir.Assistant.Inbox.Sync.Worker',start_link,[#{account => #{'__meta__' => #{'__struct__' => 'Elixir.Ecto.Schema.Metadata',context => nil,prefix => nil,schema => 'Elixir.Assistant.Inbox.Account',source => <<"inbox_accounts">>,state => loaded},'__struct__' => 'Elixir.Assistant.Inbox.Account',adapter_api_endpoint => <<"http://assistant-kme-...">>,...}}]},...}},...]}},...} in 'Elixir.GenServer':call/3 line 1023

13:57:52.450 [error] CRASH REPORT Process <0.6070.0> with 0 neighbours exited with reason: {{noproc,{'Elixir.GenServer',call,['Elixir.Assistant.Inbox.Sync.Supervisor.ProcessesSupervisor',{start_child,{312117801269135157859241902376097921414,{'Elixir.Assistant.Inbox.Sync.Worker',start_link,[#{account => #{'__meta__' => #{'__struct__' => 'Elixir.Ecto.Schema.Metadata',context => nil,prefix => nil,schema => 'Elixir.Assistant.Inbox.Account',source => <<"inbox_accounts">>,state => loaded},'__struct__' => 'Elixir.Assistant.Inbox.Account',adapter_api_endpoint => <<"http://assistant-kme-...">>,...}}]},...}},...]}},...} in 'Elixir.GenServer':call/3 line 1023

13:57:52.450 [error] Supervisor 'Elixir.Assistant.Inbox.Sync.Supervisor.Supervisor' had child 'Elixir.Horde.SignalShutdown' started with 'Elixir.GenServer':start_link('Elixir.Horde.SignalShutdown', ['Elixir.Assistant.Inbox.Sync.Supervisor']) at <0.6070.0> exit with reason {{noproc,{'Elixir.GenServer',call,['Elixir.Assistant.Inbox.Sync.Supervisor.ProcessesSupervisor',{start_child,{312117801269135157859241902376097921414,{'Elixir.Assistant.Inbox.Sync.Worker',start_link,[#{account => #{'__meta__' => #{'__struct__' => 'Elixir.Ecto.Schema.Metadata',context => nil,prefix => nil,schema => 'Elixir.Assistant.Inbox.Account',source => <<"inbox_accounts">>,state => loaded},'__struct__' => 'Elixir.Assistant.Inbox.Account',adapter_api_endpoint => <<"http://assistant-kme-...">>,...}}]},...}},...]}},...} in context shutdown_error

13:57:52.452 [error] Supervisor 'Elixir.Assistant.Inbox.Sync.Supervisor.Supervisor' had child 'Elixir.Horde.DynamicSupervisorImpl' started with 'Elixir.Horde.DynamicSupervisorImpl':start_link([{name,'Elixir.Assistant.Inbox.Sync.Supervisor'},{root_name,'Elixir.Assistant.Inbox.Sync.Supervisor'},...]) at <0.6068.0> exit with reason no such process or port in call to 'Elixir.GenServer':call('Elixir.Assistant.Inbox.Sync.Supervisor.ProcessesSupervisor', {start_child,{312117801269135157859241902376097921414,{'Elixir.Assistant.Inbox.Sync.Worker',start_link,...},...}}, infinity) in context shutdown_error
derekkraan commented 4 years ago

Can you check out this page in the docs and let me know if it helps your problem? https://hexdocs.pm/horde/eventual_consistency.html#horde-registry-merge-conflict

x-ji commented 4 years ago

Sure. We took a look at that when the bug happened, but from what we could tell, it says

The CRDT resolves the conflict and Horde.Registry sends an exit signal to the process that lost the conflict.

In our case, the problem is that apparently some process was killed, but there is not an actual duplicate process running at the same time either, or, that duplicate process is also killed by something else, maybe another registry process or a dynamicsupervisor process (which I think shouldn't be possible in the newest horde right?). So we're left with no such process running at all.

Prior to this 13:57:52.084 [info] [libcluster:assistant_service] disconnected from :"assistant@assistant-service-1.assistant-service-headless.production.svc.cluster.local" event, every worker process was running with exactly 1 copy across the nodes.

x-ji commented 4 years ago

So what you mean is that the messages in the logs are actually normal when the registry tries to shut down a duplicate process. Then perhaps it was actually some netsplit problem that resulted in all processes, not only the duplicates, end up killed?

derekkraan commented 4 years ago

I think something funny is happening here with this process kcd_inbox. That ProcessSupervisor is actually an Elixir.DynamicSupervisor, and that will shut itself down (and be restarted by its parent Supervisor) if its children require too many restarts in a short period of time.

Can you paste in the start_link for that process?

derekkraan commented 4 years ago

A "network partition" is not necessarily a full net split, any arbitrary delay in messages arriving over the network is enough to consider it "split" (aka all real networks).

x-ji commented 4 years ago

The worker module:

  def start_link(%{account: account}) do
    inbox_name = account.identifier

    case GenServer.start_link(__MODULE__, %{account: account}, name: via_tuple(inbox_name)) do
      {:ok, pid} ->
        {:ok, pid}

      {:error, {:already_started, pid}} ->
        Logger.info("#{inspect(inbox_name)} already started at #{inspect(pid)}")
        :ignore
    end
  end

In application.ex:

        Horde.DynamicSupervisor.start_child(Assistant.Inbox.Sync.Supervisor, child)

where child is produced via

    Supervisor.child_spec({Worker, %{account: account}},
      id: "kcd_inbox",
      restart: :permanent
    )

Maybe something was not done correctly in trying to start this worker under the DynamicSupervisor provided by Horde.

A "network partition" is not necessarily a full net split, any arbitrary delay in messages arriving over the network is enough to consider it "split" (aka all real networks).

Sure. I guess "netsplit" was not necessarily the right word.

x-ji commented 4 years ago

After using static membership instead of :auto, similar termination messages still occur (I assume it's still the registry trying to kill duplicate processes, as mentioned), but the cluster behaves correctly and there is indeed already a process with the same name running on another node, so all the unique workers kept running, as expected.

Maybe it would make sense for me to open another issue (about the problem of all processes registered under a certain name being killed when using :auto membership), if the error messages I posted are to be expected and different from the ones in this issue originally.

x-ji commented 4 years ago

Well actually today we saw a new error even with static membership...

10:28:56.775
assistant
08:28:56.768 [info] SIGTERM received - shutting down

10:28:56.777
assistant
08:28:56.773 [error] GenServer Assistant.Inbox.Sync.Supervisor terminating
10:28:56.783
assistant
    (horde 0.8.1) lib/horde/dynamic_supervisor_impl.ex:247: Horde.DynamicSupervisorImpl.handle_cast/2
10:28:56.783
assistant
    (stdlib 3.11.2) gen_server.erl:637: :gen_server.try_dispatch/4
10:28:56.783
assistant
Last message: {:"$gen_cast", {:relinquish_child_process, 116008031710150461911961794667856352376}}
10:28:56.783
assistant
** (MatchError) no match of right hand side value: nil
10:28:56.783
assistant
    (stdlib 3.11.2) gen_server.erl:711: :gen_server.handle_msg/6
10:28:56.783
assistant
    (stdlib 3.11.2) proc_lib.erl:249: :proc_lib.init_p_do_apply/3

after which point all the workers except for one disappeared from the horde registry.

This happened at the time of a new deployment (we're using k8s statefulset)

djthread commented 3 years ago

I've unfortunately encountered the original error in this issue, exactly, in production. It meant that my registered process died and never came back up.

I'm very interested in any ideas or mitigation measures on this one.

Thank you very much for your hard work on this tool, Derek. Using it has been nothing but joy until this one.

derekkraan commented 3 years ago

@djthread which version of Horde are you using?

djthread commented 3 years ago

0.8.3

aloukissas commented 3 years ago

FYI- we're seeing something very similar on 0.8.3. Happy to share our config or anything else that can help.

aloukissas commented 3 years ago

What I saw in the logs that could be useful in this discussion is that libcluster reported "node disconnected" errors at timestamps close to where we see these horde crashes. We've since upgraded libcluster to 3.3.0, which seems to fix issues with Kubernetes.DNS strategy and try to more proactively reconnect nodes to the cluster.

bernardo-martinez commented 2 years ago

also have been seeing this happening using horde 0.8.3 and libcluster 3.3.0 the processes gets somehow killed ending up in the error:

** (EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started

Im also glad to help in providing more examples or info in case it helps...

tyler-eon commented 1 year ago

Just wanted to chime in because I was also experiencing this issue.

I had to add the whole Process.trap stuff to my GenServer stuff because I was incorrectly assuming that :transient restart types would stay shut down on their own but it specifically mentions that it only stays shut down if you have a valid exit state. This means that if you don't explicitly trap and handle your processes terminations, then conflicts will generate invalid exit states and cause the process to try starting up again even though it shouldn't.

You can read the details about restart values if you want, but here's the TL;DR for what you need when using Horde:

  1. If you set to :permanent then you basically can't run in distributed mode because it'll cause all sorts of problems.
  2. If you set to :transient then you must Process.flag(:trap_exit, true) and add a callback to handle the {:EXIT, _from, _reason} info message.
  3. If you set to :temporary then you don't need to do anything.

My own exit handler is probably as simple as you can get:

  def handle_info({:EXIT, _from, {:name_conflict, {_key, _value}, _registry, _pid}}, state) do
    {:stop, :normal, state}
  end

And that fixed it. My processes started sending {:stop, :normal, _} which is what Elixir looks for when a :transient processes is terminated for any reason. I don't know if this is the case for everyone/most people, but it solved my own issue with this bug.

Note

I'm specifically matching on {:name_conflict, ...} because (a) I have other info messages I need to trap and (b) if any other kinds of errors occur, I do want my process to restart. I would highly recommend doing similar and only stopping if you see an actual valid termination state. Of course, if any kind of error were a valid termination state then you should just avoid this extra code entirely by using the :temporary restart value as that's effective the same thing.