Closed sebastianfelipe closed 4 years ago
is the service configured "stop-first" or "start-first"?
is the service configured "stop-first" or "start-first"?
Everything is stop-first
@sebastianfelipe I would say that it is by design.
You should either:
@sebastianfelipe I would say that it is by design.
You should either:
- reduce number of replicas to 4
- add one more worker so there is always one spare worker.
- schedule maintenance work on time when there is no updates on-going.
It cannot be like that, because:
Adding to this, it is a service with stop-first, so in theory, the max_replicas_per_node, should work.
Well this kind of use case didn't came to my mind when I implemented max replicas feature. We use it just make sure that all replicas do not get scheduled to one worker when another one goes down e.g. in case of virtualization platform crash, network failure or node reboot.
So this need to be handled on swarmkit side on way that someone implements test case which handles this situation and then modify logic to handle it correctly.
PS. IMO patching existing workers is very old skool approach. If you create flow which first add new already patched worker to swarm and then drop old one you will not see this issue.
@olljanat thanks for answering. I didn't understand, so you're saying that you thought in the case a VM is down, so what did happen here? Because I turned off definitely a VM and then appears that error, so the question is, why is it pretending to deploy services on down hosts? Maybe I didn't understand your point very well :/
@sebastianfelipe I mean if you deploy service with two replicas to swarm where is two workers both of them will run one replica. If you now reboot one of those workers then swarm will notice that there is only one running replica service and schedule second replica running on that only available worker so end result is that one worker have two replicas of service running and second worker have zero replicas.
Only way to fix that without max replicas is scale all services to 1 and back to 2 after worker reboot. Other why you end up situation where you application is not anymore fault tolerant for worker crash (if worker which have all replicas goes down then whole application is down until swarm re-start new replicas to another worker).
Btw. You see how this was implemented and which kind of tests was included from https://github.com/docker/swarmkit/pull/2758 and that it what is needed to enhance to handle your use case correctly.
This repo is not for Docker swarm you are looking for https://github.com/docker/swarmkit
Hi everyone!
Well, I found that is a really huge bug here. I work in a project with 4 VMs, 1 manager and 3 workers. 2/3 workers are active and the other one is paused (it was drained and the issue still the same). Some services have the max_replicas_per_node active.
Let's say the service is called "service-A" and has "max_replicas_per_node = 2" and "replicas=6", so in theory, service-A should be running on the 3 workers with 2 replicas in each one. Everything works fine. Now I turn off a worker, so that service-A has 4/6 replicas running. Everything OK. Now it comes the issue. I have a new release for the service-A, so I re-run the compose file or I just make a docker service update --force_service-A but I got this "1/6: no suitable node (max replicas per node limit exceed; scheduling constrain… "
I think this issue is very very important, because it doesn't let us to put new releases because a machine that is supposed to be in a non-scheduled position is trying to schedule something that it can't or maybe when a new container is creating it counts for the max_replicas_per_node so it can't be created. Well, I'm not sure what exactly is going on here, but this issue doesn't let me deploy releases.
I hope we can found some solution to this.
Thanks, Sebastián.