Open broodfusion opened 4 years ago
Hi @broodfusion, could you try commit 98f10279754308368dc0a65e8dc0af91f42d9d50
on master and let me know if this solves the issue?
Hi @derekkraan thanks for pushing a hotfix. Will try it out and let you know.
@derekkraan we were able to upgrade to commit 98f10279754308368dc0a65e8dc0af91f42d9d50
However the error is still occurring. Same exact error as the above.
FWIW we've been getting an identical error on commit 50a3cadf60fc330c7a436e6e0d29a05b2444c4bd
(#190)
I also get this error. This happened when a pod becomes temporarily unavailable in a k8s cluster, probably due to some automatic k8s maintenance operations (we're using libcluster's Kubernetes.DNSSRV
strategy).
I was able to replicate it sometimes by first scaling up the k8s stateful replicas by 1 then scaling it down again... Is it possible that the process is confused about being unable to find some process that died/became unreachable on another node?
When this happens, all worker processes under the DynamicSupervisor are stopped, which is very annoying.
I tried to use the static cluster membership to see if it would help with the issue but the error still seems to happen.
@x-ji in the stack trace shown above, the process being called is on the local node. Is the error exactly the same? Could you paste in a stacktrace just to be sure? And are you running 0.8.1?
Sure, this is the logs from one such incident where the pod disconnected for a moment (the logs are from the pod assistant-service-2
. The membership is :auto
and the strategy is UniformQuorumDistribution.
I am running 0.8.1
13:57:52.084 [info] [libcluster:assistant_service] disconnected from :"assistant@assistant-service-1.assistant-service-headless.production.svc.cluster.local"
13:57:52.089 [info] "kcd_inbox" already started at #PID<0.6083.0>
13:57:52.104 [info] => starting worker for kcd_inbox
13:57:52.104 [error] Supervisor 'Elixir.Assistant.Inbox.Sync.Supervisor.ProcessesSupervisor' had child 17011410246530512781312377734618215836 started with 'Elixir.Assistant.Inbox.Sync.Worker':start_link(#{account => #{'__meta__' => #{'__struct__' => 'Elixir.Ecto.Schema.Metadata',context => nil,prefix => ...,...},...}}) at <0.6083.0> exit with reason normal in context child_terminated
(horde 0.8.1) lib/horde/dynamic_supervisor_impl.ex:723: anonymous fn/2 in Horde.DynamicSupervisorImpl.add_children/2
(elixir 1.10.2) lib/enum.ex:2111: Enum."-reduce/3-lists^foldl/2-0-"/3
(horde 0.8.1) lib/horde/dynamic_supervisor_impl.ex:376: Horde.DynamicSupervisorImpl.handle_info/2
Last message: {:crdt_update, [{:remove, {:member_node_info, {Assistant.Inbox.Sync.Supervisor, :"assistant@assistant-service-2.assistant-service-headless.production.svc.cluster.local"}}}, {:add, {:process, 212028067600978040211832200498216929049}, {{Assistant.Inbox.Sync.Supervisor, :"assistant@assistant-service-0.assistant-service-headless.production.svc.cluster.local"}, %{id: 212028067600978040211832200498216929049, restart: :permanent, start: {Assistant.Inbox.Sync.Worker, :start_link, [...]}}, #PID<53487.2288.5>}}, {:add, {:process, 185160046781961648370148793819973379262}, {{Assistant.Inbox.Sync.Supervisor, :"assistant@assistant-service-0.assistant-service-headless.production.svc.cluster.local"}, %{id: 185160046781961648370148793819973379262, restart: :permanent, start: {Assistant.Inbox.Sync.Worker, :start_link, [...]}}, #PID<53487.2289.5>}}, {:add, {:process, 248419496788914632477591056162044717759}, {{Assistant.Inbox.Sync.Supervisor, :"assistant@assistant-service-1.assistant-service-headless.production.svc.cluster.local"}, %{id: 248419496788914632477591056162044717759, restart: :permanent, start: {Assistant.Inbox.Sync.Worker, :start_link, [...]}}, #PID<53488.3281.5>}}, {:add, {:process, 161996265316131395332152267908071840609}, {{Assistant.Inbox.Sync.Supervisor, :"assistant@assistant-service-0.assistant-service-headless.production.svc.cluster.local"}, %{id: 161996265316131395332152267908071840609, restart: :permanent, start: {Assistant.Inbox.Sync.Worker, :start_link, [...]}}, #PID<53487.2290.5>}}, {:add, {:member_node_info, {Assistant.Inbox.Sync.Supervisor, :"assistant@assistant-service-1.assistant-service-headless.production.svc.cluster.local"}}, %Horde.DynamicSupervisor.Member{name: {Assistant.Inbox.Sync.Supervisor, :"assistant@assistant-service-1.assistant-service-headless.production.svc.cluster.local"}, status: :alive}}, {:add, {:process, 299696710888083151024182740588974423924}, {{Assistant.Inbox.Sync.Supervisor, :"assistant@assistant-service-0.assistant-service-headless.production.svc.cluster.local"}, %{id: 299696710888083151024182740588974423924, restart: :permanent, start: {Assistant.Inbox.Sync.Worker, :start_link, [...]}}, #PID<53487.2291.5>}}]}
** (stop) exited in: GenServer.call(Assistant.Inbox.Sync.Supervisor.ProcessesSupervisor, {:start_child, {312117801269135157859241902376097921414, {Assistant.Inbox.Sync.Worker, :start_link, [...]}, :permanent, 5000, :worker, [Assistant.Inbox.Sync.Worker]}}, :infinity)
** (EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started
(elixir 1.10.2) lib/gen_server.ex:1013: GenServer.call/3
(horde 0.8.1) lib/horde/dynamic_supervisor_impl.ex:722: Horde.DynamicSupervisorImpl.add_children/2
(horde 0.8.1) lib/horde/dynamic_supervisor_impl.ex:717: Horde.DynamicSupervisorImpl.add_child/2
13:57:52.448 [error] GenServer Assistant.Inbox.Sync.Supervisor terminating
(elixir 1.10.2) lib/enum.ex:1396: Enum."-map/2-lists^map/1-0-"/2
(horde 0.8.1) lib/horde/dynamic_supervisor_impl.ex:426: anonymous fn/3 in Horde.DynamicSupervisorImpl.handoff_processes/1
13:57:52.441 [error] Supervisor 'Elixir.Assistant.Inbox.Sync.Supervisor.Supervisor' had child 'Elixir.Assistant.Inbox.Sync.Supervisor.ProcessesSupervisor' started with 'Elixir.Horde.ProcessesSupervisor':start_link([{shutdown,infinity},{root_name,'Elixir.Assistant.Inbox.Sync.Supervisor'},{type,supervisor},{name,...},...]) at <0.6069.0> exit with reason normal in context child_terminated
** (EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started
(elixir 1.10.2) lib/enum.ex:783: Enum."-each/2-lists^foreach/1-0-"/2
(stdlib 3.11.2) proc_lib.erl:249: :proc_lib.init_p_do_apply/3
** (stop) exited in: GenServer.call(Assistant.Inbox.Sync.Supervisor, :horde_shutting_down, 5000)
(elixir 1.10.2) lib/gen_server.ex:1023: GenServer.call/3
(horde 0.8.1) lib/horde/signal_shutdown.ex:21: anonymous fn/1 in Horde.SignalShutdown.terminate/2
(elixir 1.10.2) lib/enum.ex:783: Enum.each/2
13:57:52.450 [info] Starting Horde.DynamicSupervisorImpl with name Assistant.Inbox.Sync.Supervisor
13:57:52.449 [error] GenServer #PID<0.6070.0> terminating
** (EXIT) exited in: GenServer.call(Assistant.Inbox.Sync.Supervisor.ProcessesSupervisor, {:start_child, {312117801269135157859241902376097921414, {Assistant.Inbox.Sync.Worker, :start_link, []}, :permanent, 5000, :worker, [Assistant.Inbox.Sync.Worker]}}, :infinity)
(stdlib 3.11.2) gen_server.erl:673: :gen_server.try_terminate/3
(stdlib 3.11.2) gen_server.erl:858: :gen_server.terminate/10
Last message: {:EXIT, #PID<0.6065.0>, :shutdown}
13:57:52.444 [error] gen_server 'Elixir.Assistant.Inbox.Sync.Supervisor' terminated with reason: no such process or port in call to 'Elixir.GenServer':call('Elixir.Assistant.Inbox.Sync.Supervisor.ProcessesSupervisor', {start_child,{312117801269135157859241902376097921414,{'Elixir.Assistant.Inbox.Sync.Worker',start_link,...},...}}, infinity) in 'Elixir.GenServer':call/3 line 1013
13:57:52.448 [error] CRASH REPORT Process 'Elixir.Assistant.Inbox.Sync.Supervisor' with 0 neighbours exited with reason: no such process or port in call to 'Elixir.GenServer':call('Elixir.Assistant.Inbox.Sync.Supervisor.ProcessesSupervisor', {start_child,{312117801269135157859241902376097921414,{'Elixir.Assistant.Inbox.Sync.Worker',start_link,...},...}}, infinity) in 'Elixir.GenServer':call/3 line 1013
13:57:52.449 [error] gen_server <0.6070.0> terminated with reason: {{noproc,{'Elixir.GenServer',call,['Elixir.Assistant.Inbox.Sync.Supervisor.ProcessesSupervisor',{start_child,{312117801269135157859241902376097921414,{'Elixir.Assistant.Inbox.Sync.Worker',start_link,[#{account => #{'__meta__' => #{'__struct__' => 'Elixir.Ecto.Schema.Metadata',context => nil,prefix => nil,schema => 'Elixir.Assistant.Inbox.Account',source => <<"inbox_accounts">>,state => loaded},'__struct__' => 'Elixir.Assistant.Inbox.Account',adapter_api_endpoint => <<"http://assistant-kme-...">>,...}}]},...}},...]}},...} in 'Elixir.GenServer':call/3 line 1023
13:57:52.450 [error] CRASH REPORT Process <0.6070.0> with 0 neighbours exited with reason: {{noproc,{'Elixir.GenServer',call,['Elixir.Assistant.Inbox.Sync.Supervisor.ProcessesSupervisor',{start_child,{312117801269135157859241902376097921414,{'Elixir.Assistant.Inbox.Sync.Worker',start_link,[#{account => #{'__meta__' => #{'__struct__' => 'Elixir.Ecto.Schema.Metadata',context => nil,prefix => nil,schema => 'Elixir.Assistant.Inbox.Account',source => <<"inbox_accounts">>,state => loaded},'__struct__' => 'Elixir.Assistant.Inbox.Account',adapter_api_endpoint => <<"http://assistant-kme-...">>,...}}]},...}},...]}},...} in 'Elixir.GenServer':call/3 line 1023
13:57:52.450 [error] Supervisor 'Elixir.Assistant.Inbox.Sync.Supervisor.Supervisor' had child 'Elixir.Horde.SignalShutdown' started with 'Elixir.GenServer':start_link('Elixir.Horde.SignalShutdown', ['Elixir.Assistant.Inbox.Sync.Supervisor']) at <0.6070.0> exit with reason {{noproc,{'Elixir.GenServer',call,['Elixir.Assistant.Inbox.Sync.Supervisor.ProcessesSupervisor',{start_child,{312117801269135157859241902376097921414,{'Elixir.Assistant.Inbox.Sync.Worker',start_link,[#{account => #{'__meta__' => #{'__struct__' => 'Elixir.Ecto.Schema.Metadata',context => nil,prefix => nil,schema => 'Elixir.Assistant.Inbox.Account',source => <<"inbox_accounts">>,state => loaded},'__struct__' => 'Elixir.Assistant.Inbox.Account',adapter_api_endpoint => <<"http://assistant-kme-...">>,...}}]},...}},...]}},...} in context shutdown_error
13:57:52.452 [error] Supervisor 'Elixir.Assistant.Inbox.Sync.Supervisor.Supervisor' had child 'Elixir.Horde.DynamicSupervisorImpl' started with 'Elixir.Horde.DynamicSupervisorImpl':start_link([{name,'Elixir.Assistant.Inbox.Sync.Supervisor'},{root_name,'Elixir.Assistant.Inbox.Sync.Supervisor'},...]) at <0.6068.0> exit with reason no such process or port in call to 'Elixir.GenServer':call('Elixir.Assistant.Inbox.Sync.Supervisor.ProcessesSupervisor', {start_child,{312117801269135157859241902376097921414,{'Elixir.Assistant.Inbox.Sync.Worker',start_link,...},...}}, infinity) in context shutdown_error
Can you check out this page in the docs and let me know if it helps your problem? https://hexdocs.pm/horde/eventual_consistency.html#horde-registry-merge-conflict
Sure. We took a look at that when the bug happened, but from what we could tell, it says
The CRDT resolves the conflict and Horde.Registry sends an exit signal to the process that lost the conflict.
In our case, the problem is that apparently some process was killed, but there is not an actual duplicate process running at the same time either, or, that duplicate process is also killed by something else, maybe another registry process or a dynamicsupervisor process (which I think shouldn't be possible in the newest horde
right?). So we're left with no such process running at all.
Prior to this 13:57:52.084 [info] [libcluster:assistant_service] disconnected from :"assistant@assistant-service-1.assistant-service-headless.production.svc.cluster.local"
event, every worker process was running with exactly 1 copy across the nodes.
So what you mean is that the messages in the logs are actually normal when the registry tries to shut down a duplicate process. Then perhaps it was actually some netsplit problem that resulted in all processes, not only the duplicates, end up killed?
I think something funny is happening here with this process kcd_inbox
. That ProcessSupervisor
is actually an Elixir.DynamicSupervisor
, and that will shut itself down (and be restarted by its parent Supervisor) if its children require too many restarts in a short period of time.
Can you paste in the start_link
for that process?
A "network partition" is not necessarily a full net split, any arbitrary delay in messages arriving over the network is enough to consider it "split" (aka all real networks).
The worker module:
def start_link(%{account: account}) do
inbox_name = account.identifier
case GenServer.start_link(__MODULE__, %{account: account}, name: via_tuple(inbox_name)) do
{:ok, pid} ->
{:ok, pid}
{:error, {:already_started, pid}} ->
Logger.info("#{inspect(inbox_name)} already started at #{inspect(pid)}")
:ignore
end
end
In application.ex
:
Horde.DynamicSupervisor.start_child(Assistant.Inbox.Sync.Supervisor, child)
where child
is produced via
Supervisor.child_spec({Worker, %{account: account}},
id: "kcd_inbox",
restart: :permanent
)
Maybe something was not done correctly in trying to start this worker under the DynamicSupervisor provided by Horde.
A "network partition" is not necessarily a full net split, any arbitrary delay in messages arriving over the network is enough to consider it "split" (aka all real networks).
Sure. I guess "netsplit" was not necessarily the right word.
After using static membership instead of :auto
, similar termination messages still occur (I assume it's still the registry trying to kill duplicate processes, as mentioned), but the cluster behaves correctly and there is indeed already a process with the same name running on another node, so all the unique workers kept running, as expected.
Maybe it would make sense for me to open another issue (about the problem of all processes registered under a certain name being killed when using :auto
membership), if the error messages I posted are to be expected and different from the ones in this issue originally.
Well actually today we saw a new error even with static membership...
10:28:56.775
assistant
08:28:56.768 [info] SIGTERM received - shutting down
10:28:56.777
assistant
08:28:56.773 [error] GenServer Assistant.Inbox.Sync.Supervisor terminating
10:28:56.783
assistant
(horde 0.8.1) lib/horde/dynamic_supervisor_impl.ex:247: Horde.DynamicSupervisorImpl.handle_cast/2
10:28:56.783
assistant
(stdlib 3.11.2) gen_server.erl:637: :gen_server.try_dispatch/4
10:28:56.783
assistant
Last message: {:"$gen_cast", {:relinquish_child_process, 116008031710150461911961794667856352376}}
10:28:56.783
assistant
** (MatchError) no match of right hand side value: nil
10:28:56.783
assistant
(stdlib 3.11.2) gen_server.erl:711: :gen_server.handle_msg/6
10:28:56.783
assistant
(stdlib 3.11.2) proc_lib.erl:249: :proc_lib.init_p_do_apply/3
after which point all the workers except for one disappeared from the horde registry.
This happened at the time of a new deployment (we're using k8s statefulset)
I've unfortunately encountered the original error in this issue, exactly, in production. It meant that my registered process died and never came back up.
I'm very interested in any ideas or mitigation measures on this one.
Thank you very much for your hard work on this tool, Derek. Using it has been nothing but joy until this one.
@djthread which version of Horde are you using?
0.8.3
FYI- we're seeing something very similar on 0.8.3
. Happy to share our config or anything else that can help.
What I saw in the logs that could be useful in this discussion is that libcluster
reported "node disconnected" errors at timestamps close to where we see these horde crashes. We've since upgraded libcluster
to 3.3.0, which seems to fix issues with Kubernetes.DNS strategy and try to more proactively reconnect nodes to the cluster.
also have been seeing this happening using horde 0.8.3
and libcluster 3.3.0
the processes gets somehow killed ending up in the error:
** (EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started
Im also glad to help in providing more examples or info in case it helps...
Just wanted to chime in because I was also experiencing this issue.
I had to add the whole Process.trap
stuff to my GenServer
stuff because I was incorrectly assuming that :transient
restart types would stay shut down on their own but it specifically mentions that it only stays shut down if you have a valid exit state. This means that if you don't explicitly trap and handle your processes terminations, then conflicts will generate invalid exit states and cause the process to try starting up again even though it shouldn't.
You can read the details about restart values if you want, but here's the TL;DR for what you need when using Horde:
:permanent
then you basically can't run in distributed mode because it'll cause all sorts of problems.:transient
then you must Process.flag(:trap_exit, true)
and add a callback to handle the {:EXIT, _from, _reason}
info message.:temporary
then you don't need to do anything.My own exit handler is probably as simple as you can get:
def handle_info({:EXIT, _from, {:name_conflict, {_key, _value}, _registry, _pid}}, state) do
{:stop, :normal, state}
end
And that fixed it. My processes started sending {:stop, :normal, _}
which is what Elixir looks for when a :transient
processes is terminated for any reason. I don't know if this is the case for everyone/most people, but it solved my own issue with this bug.
I'm specifically matching on {:name_conflict, ...}
because (a) I have other info messages I need to trap and (b) if any other kinds of errors occur, I do want my process to restart. I would highly recommend doing similar and only stopping if you see an actual valid termination state. Of course, if any kind of error were a valid termination state then you should just avoid this extra code entirely by using the :temporary
restart value as that's effective the same thing.
Hi @derekkraan Seeing this type of error in our production environments with the latest changes to master (0.80-rc1) Any ideas what might be going on? Thank you