Closed mbaeuerle closed 6 years ago
We have been using this feature since a couple of weeks now and it seems to work really well. There are a few race conditions which can occur during the handover but that's somewhat expected.
In general, it would probably be better to change the design of swarm to allow connecting and disconnecting nodes and not only rely on erlang's connected nodes. But this should be a separate design discussion.
+1 to get this merged!
I've also been experimenting with this branch, and it's been working fine for me. I may end up demoing it in my ElixirConf talk in September, so it would be great if this branch or something like it could be merged and released by then. :)
Edit: I ended up switching to Horde for my demo, so no hurry on this.
I think this mostly works, however the one thing that I believe is going to be a problem is that when another node goes down, or a new one joins - the ring will be rebalanced, which means any "manually" moved registrations, will be moved back to their origin (or to a new node potentially). If that seems acceptable, I think this can be merged.
Thoughts?
If @mbaeuerle is fine with what @bitwalker mentioned and the conflict is resolved, I see no issue with merging either.
@bitwalker so the rebalancing is only an issue if the node doing the graceful shutdown is still online at that time, if I understand that correctly. But this should not happen if the intention is to shut down immediately after the handover.
@mbaeuerle indeed, that's also how I use it now. However with this API design, Swarm.Tracker.handoff/2
can be called at any time, not just at node shutdown.
To me this is fine, however, it might be good to mention this behaviour in the function's documentation.
I have now added documentation. Let me know what you think or if it can be written more clearly in any way.
Is there any documentation of the full usage? I've tried this and it seems that if I do it manually before the whole node is shut down it works, if I gracefully shutdown the node do this sometimes I get:
** (stop) exited in: :gen_statem.call(Swarm.Tracker, {:handoff, "swarm_worker_name", %{matches: %{}}}, :infinity)
** (EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started
(stdlib) gen.erl:228: :gen.do_for_proc/2
(stdlib) gen_statem.erl:598: :gen_statem.call_dirty/4
(stdlib) gen_server.erl:673: :gen_server.try_terminate/3
(stdlib) gen_server.erl:858: :gen_server.terminate/10
(stdlib) proc_lib.erl:249: :proc_lib.init_p_do_apply/3
Last message: {:EXIT, #PID<0.391.0>, :shutdown}
other times the terminate
function is called but nothing else happens
maybe the tracker can shutdown before the worker?
Intention:
handoff
but the naming is to be discussed). It Takes the name of the worker to be moved and the current stateterminate/2
function when:trap_exit
is flagged. When receiving aSIGTERM
this should trigger the handoff of the current stateParts I am not sure about:
Strategy.key_to_node/1
leaves any inconsistencies in the registry which leads to a broken stateAs @erikreedstrom mentioned, every worker can be extended like so: