bitwalker / swarm

Easy clustering, registration, and distribution of worker processes for Erlang/Elixir
MIT License
1.19k stars 102 forks source link

Feature: Graceful shutdown with handoff #83

Closed mbaeuerle closed 6 years ago

mbaeuerle commented 6 years ago

Intention:

Parts I am not sure about:

As @erikreedstrom mentioned, every worker can be extended like so:

defmodule MySwarmWorker do
  use GenServer

  def init(state) do
    Process.flag(:trap_exit, true)
    #...
    {:ok, state}
  end

  #...

  def terminate(reason, state) do
    Swarm.Tracker.handoff(__MODULE__, state)
  end
end
tschmittni commented 6 years ago

We have been using this feature since a couple of weeks now and it seems to work really well. There are a few race conditions which can occur during the handover but that's somewhat expected.

In general, it would probably be better to change the design of swarm to allow connecting and disconnecting nodes and not only rely on erlang's connected nodes. But this should be a separate design discussion.

+1 to get this merged!

dazuma commented 6 years ago

I've also been experimenting with this branch, and it's been working fine for me. I may end up demoing it in my ElixirConf talk in September, so it would be great if this branch or something like it could be merged and released by then. :)

Edit: I ended up switching to Horde for my demo, so no hurry on this.

bitwalker commented 6 years ago

I think this mostly works, however the one thing that I believe is going to be a problem is that when another node goes down, or a new one joins - the ring will be rebalanced, which means any "manually" moved registrations, will be moved back to their origin (or to a new node potentially). If that seems acceptable, I think this can be merged.

Thoughts?

beardedeagle commented 6 years ago

If @mbaeuerle is fine with what @bitwalker mentioned and the conflict is resolved, I see no issue with merging either.

mbaeuerle commented 6 years ago

@bitwalker so the rebalancing is only an issue if the node doing the graceful shutdown is still online at that time, if I understand that correctly. But this should not happen if the intention is to shut down immediately after the handover.

arjan commented 6 years ago

@mbaeuerle indeed, that's also how I use it now. However with this API design, Swarm.Tracker.handoff/2 can be called at any time, not just at node shutdown.

To me this is fine, however, it might be good to mention this behaviour in the function's documentation.

mbaeuerle commented 6 years ago

I have now added documentation. Let me know what you think or if it can be written more clearly in any way.

alex88 commented 5 years ago

Is there any documentation of the full usage? I've tried this and it seems that if I do it manually before the whole node is shut down it works, if I gracefully shutdown the node do this sometimes I get:

** (stop) exited in: :gen_statem.call(Swarm.Tracker, {:handoff, "swarm_worker_name", %{matches: %{}}}, :infinity)
    ** (EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started
    (stdlib) gen.erl:228: :gen.do_for_proc/2
    (stdlib) gen_statem.erl:598: :gen_statem.call_dirty/4
    (stdlib) gen_server.erl:673: :gen_server.try_terminate/3
    (stdlib) gen_server.erl:858: :gen_server.terminate/10
    (stdlib) proc_lib.erl:249: :proc_lib.init_p_do_apply/3
Last message: {:EXIT, #PID<0.391.0>, :shutdown}

other times the terminate function is called but nothing else happens

maybe the tracker can shutdown before the worker?