derekkraan / horde

Horde is a distributed Supervisor and Registry backed by DeltaCrdt
MIT License
1.32k stars 106 forks source link

Supervisor deadlock situation #217

Closed arjan closed 4 years ago

arjan commented 4 years ago

I seem to be getting one of my nodes in a deadlock where the Horde.DynamicSupervisorImpl module seems to be calling the ProcessesSupervisor but is deadlocked because the ProcessesSupervisor is doing a call to the Horde.DynamicSupervisorImpl.

Stack trace of the Impl genserver:

    current_stacktrace: [
      {:gen, :do_call, 4, [file: 'gen.erl', line: 208]},
      {GenServer, :call, 3, [file: 'lib/gen_server.ex', line: 1024]},
      {Horde.DynamicSupervisorImpl, :"-add_children/2-fun-1-", 2,
       [file: 'lib/horde/dynamic_supervisor_impl.ex', line: 723]},
      {Enum, :"-map/2-lists^map/1-0-", 2, [file: 'lib/enum.ex', line: 1399]},
      {Horde.DynamicSupervisorImpl, :add_children, 2,
       [file: 'lib/horde/dynamic_supervisor_impl.ex', line: 722]},
      {Horde.DynamicSupervisorImpl, :add_child, 2,
       [file: 'lib/horde/dynamic_supervisor_impl.ex', line: 717]},
      {Horde.DynamicSupervisorImpl, :handle_call, 3,
       [file: 'lib/horde/dynamic_supervisor_impl.ex', line: 156]},
      {Horde.DynamicSupervisorImpl, :handle_info, 2,
       [file: 'lib/horde/dynamic_supervisor_impl.ex', line: 333]}
    ]

And the processes supervisor process:

    current_stacktrace: [
      {:gen, :do_call, 4, [file: 'gen.erl', line: 208]},
      {GenServer, :call, 3, [file: 'lib/gen_server.ex', line: 1024]},
      {Horde.ProcessesSupervisor, :restart_child, 3,
       [file: 'lib/horde/processes_supervisor.ex', line: 1047]},
      {Horde.ProcessesSupervisor, :handle_info, 2,
       [file: 'lib/horde/processes_supervisor.ex', line: 799]},
      {:gen_server, :try_dispatch, 4, [file: 'gen_server.erl', line: 680]},
      {:gen_server, :handle_msg, 6, [file: 'gen_server.erl', line: 756]},
      {:proc_lib, :init_p_do_apply, 3, [file: 'proc_lib.erl', line: 226]}
    ]

It seems to have something to do with a child being restarted.

maybe one of these two call sites should have a normal timeout instead of :infinity?

derekkraan commented 4 years ago

Another option: update_child_pid_horde could become a cast?

derekkraan commented 4 years ago

I think this issue is related: #211

arjan commented 4 years ago

this test seems to reproduce it