Open dgageot opened 7 years ago
I can confirm the same behaviour with latest Swarm Flavour plugin
Yes I am able to reproduce this too.
I think this is a case that was not covered (a bug). Previously we had considered only the nodes instances where infrakit is not running -- ie. the worker nodes. In order to prevent run-away, or unstoppable updates (which can be an update gone wrong and the operator wants to stop everything), we had made the decision that if the group controller is stopped at any point during an update, that group controller will not resume updating when it's started back up. This seemed a reasonable solution because if the operator had a chance to fix the problem and wants the update to continue, she would just do a commit
again.
Now the same logic is applied to managers and things aren't so great. Because the leader group controller is the one that matters, if it gets interrupted before an update of all the managers is completed, the update will not continue without a manual commit again... So this applies when the leader node itself is killed in order to provision a new instance and another node picks up as the leader (on failing over) -- it is going to do nothing, because it thinks it has been interrupted and restarted for some reason, possibly because the update had gone wrong. This would explain what you are observing -- the leader node is terminated, its replacement comes back up and nothing happens (while another node had taken up leadership and paused).
This is a problem as long as we assume update is a blocking process that needs to run to completion, and not something that can be picked up intermittently unless the user explicitly cancels the update operation.
The solution here IMHO is to reverse the policy -- make the update an inherently a nonblocking process and cancelable. The default behavior would then be to continue the update when the group controller is restarted unless there's signal from user to cancel the update. This also means that we'd have to make sure the signal to cancel an update (if issued by user) is persisted so that if a leader comes back online it would honor the user intent and halts the update.
Ideas on implementing this:
I have a Swarm with 3 managers and 3 workers. All managers run the swarm flavor plugin with two committed groups:
workers
andmanagers
. All nodes, but the leader, were created by Infrakit to scale both groups.I'm then trying to update the groups with a change that requires all the nodes to be recreated (FTR the managers will keep their disk tough). This triggers a rolling update both on
workers
andmanagers
. If the first manager to be recreated is the leader, then the rolling update stops and is not carried on by the next elected leader.My Swarm ends up with only two nodes updated.
This is with the Swarm flavour plugin < 0.3.0. I couldn't test this scenario with the latest Swarm plugin because I hit another bug with it.
cc @chungers