Rolling updates with Swarm plugin stop after the Leader is recreated

dgageot commented 7 years ago

I have a Swarm with 3 managers and 3 workers. All managers run the swarm flavor plugin with two committed groups: workers and managers. All nodes, but the leader, were created by Infrakit to scale both groups.

I'm then trying to update the groups with a change that requires all the nodes to be recreated (FTR the managers will keep their disk tough). This triggers a rolling update both on workers and managers. If the first manager to be recreated is the leader, then the rolling update stops and is not carried on by the next elected leader.

My Swarm ends up with only two nodes updated.

This is with the Swarm flavour plugin < 0.3.0. I couldn't test this scenario with the latest Swarm plugin because I hit another bug with it.

cc @chungers

dgageot commented 7 years ago

I can confirm the same behaviour with latest Swarm Flavour plugin

chungers commented 7 years ago

Yes I am able to reproduce this too.

I think this is a case that was not covered (a bug). Previously we had considered only the nodes instances where infrakit is not running -- ie. the worker nodes. In order to prevent run-away, or unstoppable updates (which can be an update gone wrong and the operator wants to stop everything), we had made the decision that if the group controller is stopped at any point during an update, that group controller will not resume updating when it's started back up. This seemed a reasonable solution because if the operator had a chance to fix the problem and wants the update to continue, she would just do a commit again.

Now the same logic is applied to managers and things aren't so great. Because the leader group controller is the one that matters, if it gets interrupted before an update of all the managers is completed, the update will not continue without a manual commit again... So this applies when the leader node itself is killed in order to provision a new instance and another node picks up as the leader (on failing over) -- it is going to do nothing, because it thinks it has been interrupted and restarted for some reason, possibly because the update had gone wrong. This would explain what you are observing -- the leader node is terminated, its replacement comes back up and nothing happens (while another node had taken up leadership and paused).

This is a problem as long as we assume update is a blocking process that needs to run to completion, and not something that can be picked up intermittently unless the user explicitly cancels the update operation.

The solution here IMHO is to reverse the policy -- make the update an inherently a nonblocking process and cancelable. The default behavior would then be to continue the update when the group controller is restarted unless there's signal from user to cancel the update. This also means that we'd have to make sure the signal to cancel an update (if issued by user) is persisted so that if a leader comes back online it would honor the user intent and halts the update.

chungers commented 7 years ago

Ideas on implementing this:

On a commit, compute a SHA for each group config. The group plugin stores this SHA in memory and begins the update.
As the update begins, check in the persistent store if there's a CANCEL that's keyed by the group and the SHA. Do this for each mutation, before a terminate and create new instance. Proceed if and only if there is no CANCEL.
When the group controller is restarted (or failed over to a new node), continue the update to completion (as long as there are nodes whose config SHA don't match the computed SHA), and as long as there are no CANCEL for that SHA.
Add a new CLI to cancel an update. This will write a record of CANCEL keyed by the current group config SHA.
A CANCEL only affects a specific commit/update because it's associated to the SHA of the config being updated. When the config changes, the update (to the new SHA) will start.

docker-archive / deploykit

Rolling updates with Swarm plugin stop after the Leader is recreated #401