Feature: Graceful shutdown with handoff

mbaeuerle commented 6 years ago

Intention:

Create a function which allows for manual handoff of a specific worker (in this case handoff but the naming is to be discussed). It Takes the name of the worker to be moved and the current state
The Tracker is retrieving the entry of the actual worker
It then checks the strategy on which other node the worker should be placed (by removing the current node here)
If there is another node to move the worker's state it's done in a similar way like on a cluster resize
The handoff call can then be placed in the workers terminate/2 function when :trap_exit is flagged. When receiving a SIGTERM this should trigger the handoff of the current state

Parts I am not sure about:

I don't know if I handled the clock correctly
I am unsure if temporarily removing the current node before Strategy.key_to_node/1 leaves any inconsistencies in the registry which leads to a broken state

As @erikreedstrom mentioned, every worker can be extended like so:

defmodule MySwarmWorker do
  use GenServer

  def init(state) do
    Process.flag(:trap_exit, true)
    #...
    {:ok, state}
  end

  #...

  def terminate(reason, state) do
    Swarm.Tracker.handoff(__MODULE__, state)
  end
end

tschmittni commented 6 years ago

We have been using this feature since a couple of weeks now and it seems to work really well. There are a few race conditions which can occur during the handover but that's somewhat expected.

In general, it would probably be better to change the design of swarm to allow connecting and disconnecting nodes and not only rely on erlang's connected nodes. But this should be a separate design discussion.

+1 to get this merged!

dazuma commented 6 years ago

I've also been experimenting with this branch, and it's been working fine for me. I may end up demoing it in my ElixirConf talk in September, so it would be great if this branch or something like it could be merged and released by then. :)

Edit: I ended up switching to Horde for my demo, so no hurry on this.

bitwalker commented 6 years ago

I think this mostly works, however the one thing that I believe is going to be a problem is that when another node goes down, or a new one joins - the ring will be rebalanced, which means any "manually" moved registrations, will be moved back to their origin (or to a new node potentially). If that seems acceptable, I think this can be merged.

Thoughts?

beardedeagle commented 6 years ago

If @mbaeuerle is fine with what @bitwalker mentioned and the conflict is resolved, I see no issue with merging either.

mbaeuerle commented 6 years ago

@bitwalker so the rebalancing is only an issue if the node doing the graceful shutdown is still online at that time, if I understand that correctly. But this should not happen if the intention is to shut down immediately after the handover.

arjan commented 6 years ago

@mbaeuerle indeed, that's also how I use it now. However with this API design, Swarm.Tracker.handoff/2 can be called at any time, not just at node shutdown.

To me this is fine, however, it might be good to mention this behaviour in the function's documentation.

mbaeuerle commented 6 years ago

I have now added documentation. Let me know what you think or if it can be written more clearly in any way.

alex88 commented 5 years ago

Is there any documentation of the full usage? I've tried this and it seems that if I do it manually before the whole node is shut down it works, if I gracefully shutdown the node do this sometimes I get:

** (stop) exited in: :gen_statem.call(Swarm.Tracker, {:handoff, "swarm_worker_name", %{matches: %{}}}, :infinity)
    ** (EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started
    (stdlib) gen.erl:228: :gen.do_for_proc/2
    (stdlib) gen_statem.erl:598: :gen_statem.call_dirty/4
    (stdlib) gen_server.erl:673: :gen_server.try_terminate/3
    (stdlib) gen_server.erl:858: :gen_server.terminate/10
    (stdlib) proc_lib.erl:249: :proc_lib.init_p_do_apply/3
Last message: {:EXIT, #PID<0.391.0>, :shutdown}

other times the terminate function is called but nothing else happens

maybe the tracker can shutdown before the worker?

bitwalker / swarm

Feature: Graceful shutdown with handoff #83