Force manager leadership change on during manager rolling update

kaufers commented 6 years ago

This issue is to track the remaining work for manager rolling updates.

See https://github.com/docker/infrakit/issues/782#issuecomment-359080561

The idea is that we can update all non-leader manager nodes and then, prior to updating self, we need to force leadership change. When a new leader is elected it will detect an in-progress update and complete the update on the remaining node (the "old" leader).

chungers commented 6 years ago

PR #839 is related to this.

Since there is no API to 'demote' a swarm leader, the easiest way to accomplish this is to have the leader node destroy itself as the very last node of rolling update. This will force another node to assume leadership and that new leader will then provision a new manager node, thereby completing the rolling update.

Can we use this issue to track the testing of this behavior?

chungers commented 6 years ago

When a resource (e.g. vm instance) is destroyed, the following steps are true:

Flavor's Drain on the resource is called.
Instance plugin's Destroy is called to delete the resource.

It turns out Drain in the swarm manager flavor isn't implemented. So that's a place where we could do a docker swarm leave on the node... Then the Destroy on the resource (the "self" node), is a no-op.

This way another node could take over as leader and continue with a real Destroy.

I need to verify the behavior of the swarm managers via this sequence to make sure a new manager can join even if it has all the /var/lib/docker state from a node that technically has 'left'. If the new node can join without problems then this may be an easy implementation.

The comments for #840 on using tombstones (with symlinks) are still relevant in helping the terraform plugin recover from a crash during resource deletion.

chungers commented 6 years ago

This is the protocol for handling the last node (the current leader):

Let's say we have

ID                            HOSTNAME            STATUS              AVAILABILITY        MANAGER STATUS
y3mzqmkhjumbm80c9k628wcjt *   ip-172-31-20-100    Ready               Active              Leader
0anmdnze0p3fdky9egdsmnut5     ip-172-31-20-101    Ready               Active              Reachable
hxv6q6cej095lw66cdp2r48oh     ip-172-31-20-102    Ready               Active              Reachable

Infrakit is running on ip-172-31-20-100.
On this node we have been updating the other nodes ip-172-31-20-101 and ip-172-31-20-102.

Now this last node "self" / leader node:

root@ip-172-31-20-100:~# docker node demote y3mzqmkhjumbm80c9k628wcjt
Manager y3mzqmkhjumbm80c9k628wcjt demoted in the swarm.
root@ip-172-31-20-100:~# docker swarm leave
Node left the swarm.

At this point, on another node:

root@ip-172-31-20-102:~# docker node ls
ID                            HOSTNAME            STATUS              AVAILABILITY        MANAGER STATUS
y3mzqmkhjumbm80c9k628wcjt     ip-172-31-20-100    Down                Active
0anmdnze0p3fdky9egdsmnut5     ip-172-31-20-101    Ready               Active              Leader
hxv6q6cej095lw66cdp2r48oh *   ip-172-31-20-102    Ready               Active              Reachable

We have a working two-node quorum, ip-172-31-20-102 is the leader. At this point, the cluster is still running fine, but we are at the max tolerance of 1 node, so we'd need to bring up a new manager to join the quorum soon.
Infrakit should be running on this node now as the leader.

Infrakit provisions a new node (ip-172-31-20-103) . On this new node

root@ip-172-31-20-103:~# docker swarm join --token SWMTKN-1-4btcqpypxxd24t194hihgaf7sy8py76gktzfx1dpasq9umjo08-1g1im49494f8nc3mr8mf6yybh 172.31.20.102:2377
This node joined a swarm as a manager.

Back on the leader ip-172-31-20-102:

root@ip-172-31-20-102:~# docker node ls
ID                            HOSTNAME            STATUS              AVAILABILITY        MANAGER STATUS
y3mzqmkhjumbm80c9k628wcjt     ip-172-31-20-100    Down                Active
0anmdnze0p3fdky9egdsmnut5     ip-172-31-20-101    Ready               Active              Leader
hxv6q6cej095lw66cdp2r48oh *   ip-172-31-20-102    Ready               Active              Reachable
n88dc4rlelwz5o53utz619ip8     ip-172-31-20-103    Ready               Active              Reachable

At this point, we have restored the 3 node quorum. However, we need to do some clean up for the node y3mzqmkhjumbm80c9k628wcjt (which was the previous leader and where we originally did the demote and swarm leave.) So from the leader node:

root@ip-172-31-20-102:~# docker node rm y3mzqmkhjumbm80c9k628wcjt
y3mzqmkhjumbm80c9k628wcjt
root@ip-172-31-20-102:~# docker node ls
ID                            HOSTNAME            STATUS              AVAILABILITY        MANAGER STATUS
0anmdnze0p3fdky9egdsmnut5     ip-172-31-20-101    Ready               Active              Leader
hxv6q6cej095lw66cdp2r48oh *   ip-172-31-20-102    Ready               Active              Reachable
n88dc4rlelwz5o53utz619ip8     ip-172-31-20-103    Ready               Active              Reachable

Now we have a clean cluster with all managers successfully updated.

Note that you cannot docker node rm from a non-manager node... but any node (worker or manager), you can do docker swarm leave (assuming a docker node demote is applied to a manager).

So to summarize, this is the behavior we want:

For Managers

Bootscript: a. docker swarm init for the first manager node, which becomes the leader. b. docker swarm join --token <manager_token> <ip> for the followers.
1. Flavor.Drain needs to do the following: a. docker node demote <id> where <id> is the Swarm node id of the current leader node (self). b. docker swarm leave

For Workers

Bootscript: docker swarm join --token <worker_token> <ip> for all workers.
Flavor.Drain only needs to do docker swarm leave (no demotion necessary)

Garbage Collection

Now this doesn't account for the clean up with docker node rm <old_leader_id> to remove the old leader node which is now considered in the Down state. I think this clean up mechanism could be implemented separately as another continuous process that just does docker node rm on anything that is in the Down status and no actual VM instances with the link tag. We could even generalize this into a reaper that can garbage collect running vm instances that somehow have no corresponding docker node ls entries (failed to join the cluster) and those Down entries in the Swarm with no corresponding vm instances (the case with the old leader). To keep the scope of this issue manageable, I'd leave garbage collection as a separate issue or PR.

docker-archive / deploykit