max_replicas_per_node in paused/drained nodes drives to -> no suitable node when updating service

sebastianfelipe commented 4 years ago

Hi everyone!

Well, I found that is a really huge bug here. I work in a project with 4 VMs, 1 manager and 3 workers. 2/3 workers are active and the other one is paused (it was drained and the issue still the same). Some services have the max_replicas_per_node active.

Let's say the service is called "service-A" and has "max_replicas_per_node = 2" and "replicas=6", so in theory, service-A should be running on the 3 workers with 2 replicas in each one. Everything works fine. Now I turn off a worker, so that service-A has 4/6 replicas running. Everything OK. Now it comes the issue. I have a new release for the service-A, so I re-run the compose file or I just make a docker service update --force _service-A but I got this "1/6: no suitable node (max replicas per node limit exceed; scheduling constrain… "

Screenshot 2020-04-06 15 25 45

I think this issue is very very important, because it doesn't let us to put new releases because a machine that is supposed to be in a non-scheduled position is trying to schedule something that it can't or maybe when a new container is creating it counts for the max_replicas_per_node so it can't be created. Well, I'm not sure what exactly is going on here, but this issue doesn't let me deploy releases.

I hope we can found some solution to this.

Thanks, Sebastián.

thaJeztah commented 4 years ago

is the service configured "stop-first" or "start-first"?

sebastianfelipe commented 4 years ago

is the service configured "stop-first" or "start-first"?

Everything is stop-first

Screenshot 2020-04-06 16 21 28

olljanat commented 4 years ago

@sebastianfelipe I would say that it is by design.

You should either:

reduce number of replicas to 4
add one more worker so there is always one spare worker.
schedule maintenance work on time when there is no updates on-going.

sebastianfelipe commented 4 years ago

@sebastianfelipe I would say that it is by design.

You should either:

reduce number of replicas to 4

add one more worker so there is always one spare worker.

schedule maintenance work on time when there is no updates on-going.

It cannot be like that, because:

if I reduce the number of replicas to 4, it is 2/2 and 2/2 in each worker, so the case is the same
add one more worker -> cost
It needs to work with CI/CD effectively, no applying some patch to a bad behaviour that's not clean to me, there's a stack with more than 10 services, so in each docker compose change I'll have to patch everything and that is not what I'm looking for. If I have 6 replicas running and I decrease to 3, what will happen with the performance? Is it necessary to have in mind that the solution cannot be on my hand, it is necessary that every change still working without major problem.

sebastianfelipe commented 4 years ago

Adding to this, it is a service with stop-first, so in theory, the max_replicas_per_node, should work.

olljanat commented 4 years ago

Well this kind of use case didn't came to my mind when I implemented max replicas feature. We use it just make sure that all replicas do not get scheduled to one worker when another one goes down e.g. in case of virtualization platform crash, network failure or node reboot.

So this need to be handled on swarmkit side on way that someone implements test case which handles this situation and then modify logic to handle it correctly.

PS. IMO patching existing workers is very old skool approach. If you create flow which first add new already patched worker to swarm and then drop old one you will not see this issue.

sebastianfelipe commented 4 years ago

@olljanat thanks for answering. I didn't understand, so you're saying that you thought in the case a VM is down, so what did happen here? Because I turned off definitely a VM and then appears that error, so the question is, why is it pretending to deploy services on down hosts? Maybe I didn't understand your point very well :/

olljanat commented 4 years ago

@sebastianfelipe I mean if you deploy service with two replicas to swarm where is two workers both of them will run one replica. If you now reboot one of those workers then swarm will notice that there is only one running replica service and schedule second replica running on that only available worker so end result is that one worker have two replicas of service running and second worker have zero replicas.

Only way to fix that without max replicas is scale all services to 1 and back to 2 after worker reboot. Other why you end up situation where you application is not anymore fault tolerant for worker crash (if worker which have all replicas goes down then whole application is down until swarm re-start new replicas to another worker).

Btw. You see how this was implemented and which kind of tests was included from https://github.com/docker/swarmkit/pull/2758 and that it what is needed to enhance to handle your use case correctly.

justincormack commented 4 years ago

This repo is not for Docker swarm you are looking for https://github.com/docker/swarmkit

docker-archive / classicswarm

max_replicas_per_node in paused/drained nodes drives to -> no suitable node when updating service #2979