Open leojonathanoh opened 5 years ago
yep I can reproduce this as well
@cjdcordeiro how did you reproduce this? What is your environment?
Just an update, since I opened this issue, I am still experiencing the very same problem. It seems to be directly related to using docker stack rm
more so than docker stack update
. If you only ever use docker stack deploy
and never docker stack rm
then the problem might never happen.
From my recent experiences on the command line, using a docker stack rm
(similar to the reproduce steps above) removes most resources, then when it reaches the point of removing networks it actually removes a few then abruptly hangs for about 10-20 seconds before finally showing that message Failed to remove network: network xxxxx not found
. Then, for the next 20-30 seconds, the docker daemon actually hangs: any autocompletion of docker service command lines such as docker service update <TAB>
that requires listing docker objects seems to hang for that duration until the docker daemon finally clears up and responds, emitting a bell. However, I did not test whether other docker command lines such as docker ps
that are unrelated to Swarm
were hung during the duration of that 'hang', because every time I experienced the hang, my command line was hung.
Notably, when using Portainer Web UI to remove a stack, Portainer shows a blank screen with an error red error message on the top right 'Unable to communicate with endpoint'. Portainer then stops working for about the same duration (20-30 seconds), consistent with the duration of the 'hang' on the command line.
@leojonathanoh exactly as you've described. Simply doing a docker deploy
and then docker rm
. The error message will appear and then if I try to re-deploy a stack with the same name, docker will try to pick up that same network, cause it is still listed, but somehow broken.
@cjdcordeiro Just curious, what was the specs of your stack like? E.g. number of services, networks.
The issue seems to only happen for a stack with at least 2 services, and at least 2 networks.
y I had about 6 or 7 services and 3 networks
@cjdcordeiro thanks for the info
Hopefully others who experience the same issue can share your stack specs, so we can narrow down scope of the issue, and hopefully get a bug fix.
@leojonathanoh and @cjdcordeiro , do you folks see this issue with the latest Docker CE version 19.03.0-rc2
, there have been some fixes in this area related to stale lb-endpoints (load balancer endpoints) in the last few months
@leojonathanoh and @cjdcordeiro , do you folks see this issue with the latest Docker CE version
19.03.0-rc2
, there have been some fixes in this area related to stale lb-endpoints (load balancer endpoints) in the last few months
Yes, this is still occurring on Docker version 19.03.1, build 74b1e89e8a
I see this with:
Client:
Version: 18.09.2
API version: 1.39
Go version: go1.10.4
Git commit: 6247962
Built: Tue Feb 26 23:52:23 2019
OS/Arch: linux/amd64
Experimental: false
Server:
Engine:
Version: 18.09.2
API version: 1.39 (minimum version 1.12)
Go version: go1.10.4
Git commit: 6247962
Built: Wed Feb 13 00:24:14 2019
OS/Arch: linux/amd64
Experimental: false
On stack rm:
Removing network update_platform
Failed to remove network o3pomemc42ivv4dwg5gzp2zgv: Error response from daemon: network o3pomemc42ivv4dwg5gzp2zgv not foundFailed to remove some resources from stack: update
I have 8 networks defined per docker network ls
If I run the stack rm
command a few times with sleeps between them, the state is properly restored (eg: docker stack rm update ; sleep 5 ;docker stack rm update ; sleep 5 ;docker stack rm update ; sleep 5 ; docker stack rm update ; sleep 5 ; docker stack deploy update --compose-file docker-compose.yml
)
its been some time, i'll share a simple workaround ive been using that 100% works rm
ing a stack.
Create a tmp docker-stack.yml
$ cat docker-stack.yml
version: '3.7'
services:
tmp:
image: alpine
entrypoint: /bin/sh
command:
- -c
- 'sleep 1000000000'
Then deploy over the current stack my-stack
:
docker stack deploy -c docker-stack.yml my-stack --prune
That 100% works in removing everything in the original stack my-stack
. After which you may safely remove the stack with the tmp
service:
$ docker stack rm my-stack
Hope it works for anyone out there.
I have reproduced this behaviour on 2 different swarm clusters. For info see my comment on a similar issue to this that has been open on the moby repo since 2016
I got the issue with containers previously bound to a stack, still running, but not listed anymore using docker service ls
.
When inspecting the network,they are listed under the "Containers" key.
Killing them was enough to delete the network (docker ps | awk '/<stack name>_/{print $1}' | xargs docker kill
)
The same error occurred to me even with one network. it is exactly as @thosil said. after removing the stack, one service didn't shutdown and that is keeping the network from being deleted. It took more that 20 seconds to stop that service manually with docker stop
and after that the network automatically got removed.
Expected behavior
A stack should
rm
anddeploy
cleanlyActual behavior
docker-stack rm
and a subsequentdocker stack deploy
of the same stack name fails with an error thatnetwork xxxxx not found
, resulting in a failed deployment or update of a stack.Command line:
Explanation
As seen above, remove a stack
my-stack
, and you get an errorFailed to remove network j1u3lx3xr81hdxbz4twbxggdp
, with this network being present indocker network ls
and calledmy-stack_db-maintenance-network
, but any attempt to remove it withdocker network rm j1u3lx3xr81hdxbz4twbxggdp
hopelessly shows that the network namedmy-stack_db-maintenance-network
cannot be removed, which is confirmed by the logged error by the Docker daemon in/var/log/syslog
. Any subsequent deployment of the stackmy-stack
fails with the same error, and the stack fails to be created or updated cleanly.The only workaround is to restart the
docker
daemon, and the 'missing' network goes away. This however, is not viable for production systems.Another workaround is to use a different stack name; in this case, use the stack name
my-stack-2
, and the commanddocker stack deploy -c docker-stack.yml my-stack-2
. This approach is a redeployment of the entire stack under a different stack name, which recreates the networks from scratch namespaced by the stack name. However, such an approach equates to deploying a new stack instead updating an existing stack. It also implies that the deployment process (CD
) must be able to detect when the deployment fails, which, should be the job of the orchestration system.Consequences
Because of this behaviour, a stack cannot be updated completely: sometimes, some of the other services that dont use that 'missing' network are updated, but sometimes none of them are.
The phantom or residue network does not disappear even after many weeks.
Steps to reproduce the behavior
EDIT: One way to reproduce this is to remove an existing stack and redeploying it:
docker stack rm my-stack; docker stack deploy -c docker-stack.yml my-stack
The
docker-stack.yml
is as such:In this case, the
pma
service might be removed, but the 'residue' of itsmy-stack_db-maintenance-network
is left.Output of
docker version
:Output of
docker info
:Additional environment details (AWS, VirtualBox, physical, etc.)
Additional Investigation
When does the behavior not occur?
When does the behavior occur?