Closed kaufers closed 6 years ago
PR #839 is related to this.
Since there is no API to 'demote' a swarm leader, the easiest way to accomplish this is to have the leader node destroy itself as the very last node of rolling update. This will force another node to assume leadership and that new leader will then provision a new manager node, thereby completing the rolling update.
Can we use this issue to track the testing of this behavior?
When a resource (e.g. vm instance) is destroyed, the following steps are true:
Drain
on the resource is called.Destroy
is called to delete the resource.It turns out Drain
in the swarm manager flavor isn't implemented. So that's a place where we could do a docker swarm leave
on the node... Then the Destroy
on the resource (the "self" node), is a no-op.
This way another node could take over as leader and continue with a real Destroy.
I need to verify the behavior of the swarm managers via this sequence to make sure a new manager can join even if it has all the /var/lib/docker state from a node that technically has 'left'. If the new node can join without problems then this may be an easy implementation.
The comments for #840 on using tombstones (with symlinks) are still relevant in helping the terraform plugin recover from a crash during resource deletion.
This is the protocol for handling the last node (the current leader):
Let's say we have
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS
y3mzqmkhjumbm80c9k628wcjt * ip-172-31-20-100 Ready Active Leader
0anmdnze0p3fdky9egdsmnut5 ip-172-31-20-101 Ready Active Reachable
hxv6q6cej095lw66cdp2r48oh ip-172-31-20-102 Ready Active Reachable
ip-172-31-20-100
. ip-172-31-20-101
and ip-172-31-20-102
.root@ip-172-31-20-100:~# docker node demote y3mzqmkhjumbm80c9k628wcjt
Manager y3mzqmkhjumbm80c9k628wcjt demoted in the swarm.
root@ip-172-31-20-100:~# docker swarm leave
Node left the swarm.
At this point, on another node:
root@ip-172-31-20-102:~# docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS
y3mzqmkhjumbm80c9k628wcjt ip-172-31-20-100 Down Active
0anmdnze0p3fdky9egdsmnut5 ip-172-31-20-101 Ready Active Leader
hxv6q6cej095lw66cdp2r48oh * ip-172-31-20-102 Ready Active Reachable
ip-172-31-20-102
is the leader. At this point, the cluster is still running fine, but we are at the max tolerance of 1 node, so we'd need to bring up a new manager to join the quorum soon.ip-172-31-20-103
) . On this new node
root@ip-172-31-20-103:~# docker swarm join --token SWMTKN-1-4btcqpypxxd24t194hihgaf7sy8py76gktzfx1dpasq9umjo08-1g1im49494f8nc3mr8mf6yybh 172.31.20.102:2377
This node joined a swarm as a manager.
Back on the leader ip-172-31-20-102:
root@ip-172-31-20-102:~# docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS
y3mzqmkhjumbm80c9k628wcjt ip-172-31-20-100 Down Active
0anmdnze0p3fdky9egdsmnut5 ip-172-31-20-101 Ready Active Leader
hxv6q6cej095lw66cdp2r48oh * ip-172-31-20-102 Ready Active Reachable
n88dc4rlelwz5o53utz619ip8 ip-172-31-20-103 Ready Active Reachable
At this point, we have restored the 3 node quorum. However, we need to do some clean up for the node y3mzqmkhjumbm80c9k628wcjt
(which was the previous leader and where we originally did the demote
and swarm leave
.) So from the leader node:
root@ip-172-31-20-102:~# docker node rm y3mzqmkhjumbm80c9k628wcjt
y3mzqmkhjumbm80c9k628wcjt
root@ip-172-31-20-102:~# docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS
0anmdnze0p3fdky9egdsmnut5 ip-172-31-20-101 Ready Active Leader
hxv6q6cej095lw66cdp2r48oh * ip-172-31-20-102 Ready Active Reachable
n88dc4rlelwz5o53utz619ip8 ip-172-31-20-103 Ready Active Reachable
Now we have a clean cluster with all managers successfully updated.
Note that you cannot docker node rm
from a non-manager node... but any node (worker or manager), you can do docker swarm leave
(assuming a docker node demote
is applied to a manager).
So to summarize, this is the behavior we want:
docker swarm init
for the first manager node, which becomes the leader.
b. docker swarm join --token <manager_token> <ip>
for the followers.
docker node demote <id>
where <id>
is the Swarm node id of the current leader node (self).
b. docker swarm leave
docker swarm join --token <worker_token> <ip>
for all workers.docker swarm leave
(no demotion necessary)Now this doesn't account for the clean up with docker node rm <old_leader_id>
to remove the old leader node which is now considered in the Down
state. I think this clean up mechanism could be implemented separately as another continuous process that just does docker node rm
on anything that is in the Down
status and no actual VM instances with the link
tag. We could even generalize this into a reaper that can garbage collect running vm instances that somehow have no corresponding docker node ls
entries (failed to join the cluster) and those Down
entries in the Swarm with no corresponding vm instances (the case with the old leader). To keep the scope of this issue manageable, I'd leave garbage collection as a separate issue or PR.
This issue is to track the remaining work for manager rolling updates.
See https://github.com/docker/infrakit/issues/782#issuecomment-359080561
The idea is that we can update all non-leader manager nodes and then, prior to updating
self
, we need to force leadership change. When a new leader is elected it will detect an in-progress update and complete the update on the remaining node (the "old" leader).