Clearly explain that draining a swarm node does not wait for replcas to be started on an active node before stopping tasks on a node being drained

airmnichols commented 4 years ago

File: engine/swarm/swarm-tutorial/drain-node.md

States:

"Sometimes, such as planned maintenance times, you need to set a node to DRAIN availability. DRAIN availability prevents a node from receiving new tasks from the swarm manager. It also means the manager stops tasks running on the node and launches replica tasks on a node with ACTIVE availability."

This is misleading in that a drain operation has no logic to maintain the configured number of replicas during a drain operation.

This should be clearly explained.

If you have a two worker node swarm and have performed maintenance on worker node 1, this has all replicas running on worker node 2.

If you then drain worker node 2 for patching, it causes downtime because swarm doesn't for example, stop replica 1 on node 2, start replica 1 on node 1 before moving on to do the same for replica 2.

The current design causes downtime for applications. Support advised this is expected behavior and a workaround is to reconfigure all running services to have more replicas to force them to start on another worker node before issuing a drain command for a node.

daliborfilus commented 2 years ago

Yes! Bitten by this just now.

airmnichols commented 2 years ago

Yes! Bitten by this just now.

Kubernetes with pod disruption budgets is the way honestly. After moving from swarm to k8s things have been so much more reliable.

docker-robott commented 1 year ago

There hasn't been any activity on this issue for a long time. If the problem is still relevant, mark the issue as fresh with a /remove-lifecycle stale comment. If not, this issue will be closed in 14 days. This helps our maintainers focus on the active issues.

Prevent issues from auto-closing with a /lifecycle frozen comment.

/lifecycle stale

daliborfilus commented 1 year ago

@docker-robot It's not our fault that the maintaniners are busy. That doesn't make the issue invalid. I'd like every damn bot (and their masters) to know this. I understand that having these bots helps triage important issues like a garbage collector, but a human should decide if it's garbage or not. Not a "timeout".

everyx commented 1 year ago

This is really confusing and reduces flexibility and reliability, now I need to manually configure a label instead of relying on this built-in availability feature, hope this can be improved.

everyx commented 1 year ago

docker / docs

Clearly explain that draining a swarm node does not wait for replcas to be started on an active node before stopping tasks on a node being drained #9917