I am seeing an issue that crops up when we deploy our code, where conflicting registries between two nodes are never resolved. I'm using the StaticQuorumRing strategy, and our deployments drain a node, bring new node back up, repeat. We're using Swarm to make sure one instance of a process is running in the cluster, so our registries look like this:
So the issue here, is that the remote process indicated for profile_kv_repo didn't exist. But the code that handles these sorts of conflicts here does nothing if the conflicting node is remote, which I believe sets it up for a loop of trying to start that particular process again. Doing Swarm.unregister_name({Commanded.Event.Handler, "profile_kv_repo"}) on the node with the bad entry resolved the issue.
So a couple of questions:
How did it get in this state? Is there a deployment process we should follow to fix this?
Is there something that Swarm can do to resolve this issue? Maybe when Swarm detects a process is down, it can remove it's registration for that pid and then attempt to restart?
I am seeing an issue that crops up when we deploy our code, where conflicting registries between two nodes are never resolved. I'm using the StaticQuorumRing strategy, and our deployments drain a node, bring new node back up, repeat. We're using Swarm to make sure one instance of a process is running in the cluster, so our registries look like this:
Sometimes, two nodes seem to get in an infinite loop of trying to stop and restart one of those processes.
And the above goes on and on...
Remote shelling in, I was able to see that both nodes are communicating, but one node has a registry like this:
So the issue here, is that the remote process indicated for profile_kv_repo didn't exist. But the code that handles these sorts of conflicts here does nothing if the conflicting node is remote, which I believe sets it up for a loop of trying to start that particular process again. Doing
Swarm.unregister_name({Commanded.Event.Handler, "profile_kv_repo"})
on the node with the bad entry resolved the issue.So a couple of questions: