derekkraan / horde

Horde is a distributed Supervisor and Registry backed by DeltaCrdt
MIT License
1.32k stars 106 forks source link

Child PID missing from state on shutdown #195

Closed brndnmtthws closed 4 years ago

brndnmtthws commented 4 years ago

While doing a deployment on kubernetes, I noticed this error occasionally:

00:12:09.100 [error] GenServer Service.Scheduler.HordeSupervisor terminating
** (MatchError) no match of right hand side value: nil
    (horde 0.7.1) lib/horde/dynamic_supervisor_impl.ex:252: Horde.DynamicSupervisorImpl.handle_cast/2
    (stdlib 3.11.2) gen_server.erl:637: :gen_server.try_dispatch/4
    (stdlib 3.11.2) gen_server.erl:711: :gen_server.handle_msg/6
    (stdlib 3.11.2) proc_lib.erl:249: :proc_lib.init_p_do_apply/3
Last message: {:"$gen_cast", {:relinquish_child_process, 120031833532712903379486492195407090876}}
00:12:09.102 [error] GenServer #PID<0.4364.0> terminating
** (stop) exited in: GenServer.call(Service.Scheduler.HordeSupervisor, :horde_shutting_down, 5000)
    ** (EXIT) an exception was raised:
        ** (MatchError) no match of right hand side value: nil
            (horde 0.7.1) lib/horde/dynamic_supervisor_impl.ex:252: Horde.DynamicSupervisorImpl.handle_cast/2
            (stdlib 3.11.2) gen_server.erl:637: :gen_server.try_dispatch/4
            (stdlib 3.11.2) gen_server.erl:711: :gen_server.handle_msg/6
            (stdlib 3.11.2) proc_lib.erl:249: :proc_lib.init_p_do_apply/3
    (elixir 1.10.1) lib/gen_server.ex:1023: GenServer.call/3
    (horde 0.7.1) lib/horde/signal_shutdown.ex:21: anonymous fn/1 in Horde.SignalShutdown.terminate/2
    (elixir 1.10.1) lib/enum.ex:783: Enum."-each/2-lists^foreach/1-0-"/2
    (elixir 1.10.1) lib/enum.ex:783: Enum.each/2
    (stdlib 3.11.2) gen_server.erl:673: :gen_server.try_terminate/3
    (stdlib 3.11.2) gen_server.erl:858: :gen_server.terminate/10
    (stdlib 3.11.2) proc_lib.erl:249: :proc_lib.init_p_do_apply/3
Last message: {:EXIT, #PID<0.4360.0>, :shutdown}

It looks like the root cause is related to the child PID in question not being present in the node's state. Could just be a matter of the CRDT not being fully synced.

Anyway, I think it would be sensible to handle this gracefully (and not crash the process). Something like this should work:

  def handle_cast({:relinquish_child_process, child_id}, state) do
    # signal to the rest of the nodes that this process has been relinquished
    # (to the Horde!) by its parent
    case Map.get(state.processes_by_id, child_id) do
      {_, child, _} ->
        :ok =
          DeltaCrdt.mutate(
            crdt_name(state.name),
            :add,
            [{:process, child.id}, {nil, child}]
          )

      nil ->
        # the process doesn't exist in the local state. state not in sync?
        nil
    end

    {:noreply, state}
  end

I'll just add it to my existing PR (#194).