Open markmeeus opened 6 years ago
@markmeeus did you ever resolve this? and if so, what was the solution?
I've noticed that I have the same problem when running a cluster of 2 nodes, when I run 3 nodes the :begin_handoff
and :resolve_conflict
callbacks are called as expected.
Same here. the processes are untracked but are never restarted on the other node for some reason. None of them events are called. Seems from some other issues that you may need to handle graceful shutdowns manually https://github.com/bitwalker/swarm/pull/83. But they're only talking about "handoffs" even though I just want to do a restart... So it's still confusing.
Hi,
Thanks for the great lib, awesome work!
We are building a system where we have to process a stream of messages from a remote system. To be more precise, the stream of data are update messages for a specific resources.
So we are using Swarm to manage worker processes for these resources. We create the process when the first message arrives, and let it die from a genserver timeout after a set interval.
We are now in the process of testing the app for a network split situation, and it looks like something is a bit off, or at least, we are a bit confused :-)
First of all, we are using the Distribution.Ring since we will be deploying on an OpenShift instance (Kubernetes) We are testing this by simply connecting 2 iex nodes running our app with
Node.connect/1
andNode.disconnect/1
What happens is the following -> When the 2 nodes are connected, a new message for a new resource creates a process on one of the 2 nodes, perfect -> When the 2 nodes are disconnected, all processes that existed before the disconnect are started (if the were not already running) on both machines, still makes sense. -> When the 2 are connected again, the processes keep running on one of the nodes and the processes on the other node receive a {:swarm, :die}, strange, we were actually expecting them to be relocated, the Ring uses a consistent hash?
While the nodes were disconnected, both processes could have processed messages and they maybe in a conflicting state. We have way to actually resolve these conflicts and we were counting on the :resolve_conflict message to inform us about this situation.
However none of the
:begin_handoff
,:end_handoff
or:resolve_conflict
messages are received on either Node.our begin_handoff returns a :resume tuple, but since it is not called ...
Could you help us out?
Here are some debug logs of the disconnect/connect:
Node A "attila" => disconnect
Node B "mark" => received disconnect
... Some processing going on while nodes are disconnected ...
Node A "attila" => connect with Node B
Node B "mark" => receive 2nd connect