Closed ndeloof closed 4 years ago
@vieux @abronan any thoughts ?
@ndeloof I'll take a look asap.
Same here on our staging swarm with 1.1.3
- custom rescheduling.
We were starting stacks with docker-compose
; two stacks which have containers connected in a network together (separate compose files, with container affinity constrains) show errors.
So I tried to manually remove the network
docker network rm myapp_default/app.myapp.com# docker network rm myapp_default
Error response from daemon: 500 Internal Server Error: network myapp_default has active endpoints
Inspected it...
root@sepp-roj:/repo/stacks/auto/myapp/app.myapp.com# docker network inspect myapp_default
docker network inspect myapp_default
[
{
"Name": "myapp_default",
"Id": "f67879d64b0229990bcd9c43e1e57630fd548eba0afabdf85b33022afee73d80",
"Scope": "global",
"Driver": "overlay",
"IPAM": {
"Driver": "default",
"Options": null,
"Config": [
{
"Subnet": "10.0.18.0/24",
"Gateway": "10.0.18.1/24"
}
]
},
"Containers": {
"ep-61307b606ed6ee0176cbd6acb7b3d031a182851bc343c9ac4c4466cb7700136d": {
"Name": "myapp_worker_1",
"EndpointID": "61307b606ed6ee0176cbd6acb7b3d031a182851bc343c9ac4c4466cb7700136d",
"MacAddress": "02:42:0a:00:12:05",
"IPv4Address": "10.0.18.5/24",
"IPv6Address": ""
},
"ep-8252faf56a8732db71ddb35ce90a803ad9fc44fd92cfa684aa6c5faef8b23ead": {
"Name": "myapp_redis_1",
"EndpointID": "8252faf56a8732db71ddb35ce90a803ad9fc44fd92cfa684aa6c5faef8b23ead",
"MacAddress": "02:42:0a:00:12:02",
"IPv4Address": "10.0.18.2/24",
"IPv6Address": ""
},
"ep-8265326f50459e8f6314b9f9496fcf84aa628eb840a6885ac3eb6266d66a1de6": {
"Name": "myapp_nginx_1",
"EndpointID": "8265326f50459e8f6314b9f9496fcf84aa628eb840a6885ac3eb6266d66a1de6",
"MacAddress": "02:42:0a:00:12:04",
"IPv4Address": "10.0.18.4/24",
"IPv6Address": ""
},
"ep-da381f7ad537a548457878900e588538ce46d277b0ae03c6db4c578be7c65ceb": {
"Name": "myapp_php_1",
"EndpointID": "da381f7ad537a548457878900e588538ce46d277b0ae03c6db4c578be7c65ceb",
"MacAddress": "02:42:0a:00:12:03",
"IPv4Address": "10.0.18.3/24",
"IPv6Address": ""
}
},
"Options": {}
}
]
And tried to remove the containers, which fails 💥
root@sepp-roj:/repo/stacks/auto/myapp/app.myapp.com# docker inspect myapp_php_1
docker inspect myapp_php_1
[]
Error: No such image or container: myapp_php_1
Trying to figure out a way to start the stacks, without renaming them.
PS: I think our overlay
is not working 100% properly, but this does not affect our deployments at the moment, since they all end up on the same node.
After trying "everything" from docker rm
over docker network rm
, docker-compose down
, etc..
The only workaround I found is to manually remove the keys from our consul discovery service.
If a container is registered with a network, but the container is no longer existing (for some reason - eg. node constraints failed in our case), it is not possible to remove the network (retried with swarm 1.2.1-rc1
) or to remove the container from the network (since docker complains about a non-existing container 😄).
There should either be a --force
option for network rm
or network disconnect
.
same problem here rescheduling a container in a overlay network generated by docker-compose, i think the same, new container is failing to start because the previous instance hasn't been cleaned.
The same problem on swarm 1.2.1
network disconnect
has force option since Docker 1.10. Can you let us know if that can manually resolve your problem? If yes, Swarm can take this logic to clean up an endpoint.
$ docker -H swarm-master-0:2375 network disconnect --help
Usage: docker network disconnect [OPTIONS] NETWORK CONTAINER
Disconnects container from a network
-f, --force Force the container to disconnect from a network
--help Print usage
more in #2149
@dongluochen after removing a container from a network it runs in another issue:
System error: nosandbox: error locating sandbox id e671d6cc9648672e5776020a09354310d764c4edc13dd67e60efc0d50e23860f: no sandbox found
@ndeloof @svscorp @schmunk42 #2436 fixes the issue of rescheduling containers with overlay network. The fix is included in Swarm 1.2.5. Can you test 1.2.5 to see if your problem is resolved? Your feedback is appreciated.
on swarm 1.2.5 This fix create container but does not start it, even if you set restart always. in logs I see
time="2016-09-03T18:33:09Z" level=error msg="Flagging engine as unhealthy. Connect failed 3 times" id="TCTR:BV5C:25BT:GRGL:L5DN:JBAF:VONJ:Z5NH:JARW:BVH4:CITP:JQAS" name=ip-10-0-2-212
time="2016-09-03T18:33:09Z" level=warning msg="Failed to remove network endpoint from old container hopeful_bassi: Error response from daemon: endpoint hopeful_bassi not found"
time="2016-09-03T18:33:09Z" level=info msg="Rescheduled container 9dc721640eb7497ece709cf5572cc352379c77b482c62f1fbfe6aacd99bc4161 from ip-10-0-2-212 to ip-10-0-2-249 as 27f6ba6aecefad13b17cebbffb221b146071851e9b562e11a8b4b60745aeca15"
time="2016-09-03T18:34:05Z" level=error msg="Update engine specs failed: Cannot connect to the Docker daemon. Is the docker daemon running on this host?" id="TCTR:BV5C:25BT:GRGL:L5DN:JBAF:VONJ:Z5NH:JARW:BVH4:CITP:JQAS" name=ip-10-0-2-212
time="2016-09-03T18:35:15Z" level=info msg="Removed Engine ip-10-0-2-212"
ubuntu@ip-10-0-1-8:~$ docker -H :4000 ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
27f6ba6aecef redis "docker-entrypoint.sh" 5 minutes ago Created ip-10-0-2-249/hopeful_bassi
56f2595af82e gliderlabs/registrator "/bin/registrator --i" 29 minutes ago Up 12 minutes ip-10-0-2-249/registrator
ebd37173dda6 swarm:1.2.5 "/swarm --experimenta" 29 minutes ago Up 12 minutes 2375/tcp ip-10-0-2-249/swarm
ubuntu@ip-10-0-1-8:~$ docker -H :4000 inspect 27f6ba6aecef
[
{
"Id": "27f6ba6aecefad13b17cebbffb221b146071851e9b562e11a8b4b60745aeca15",
"Created": "2016-09-03T18:33:09.764678137Z",
"Path": "docker-entrypoint.sh",
"Args": [
"redis-server"
],
"State": {
"Status": "created",
"Running": false,
"Paused": false,
"Restarting": false,
"OOMKilled": false,
"Dead": false,
"Pid": 0,
"ExitCode": 0,
"Error": "",
"StartedAt": "0001-01-01T00:00:00Z",
"FinishedAt": "0001-01-01T00:00:00Z"
},
"Image": "sha256:50e38ce0458ffbd0edb6b340287a38e44263c80abe20739492c8faa0e3281465",
"ResolvConfPath": "",
"HostnamePath": "",
"HostsPath": "",
"LogPath": "",
"Node": {
"ID": "B3K4:SH4I:3WHO:3CLE:SQP3:MGRF:7PGS:YWXZ:UGPG:7SL4:PWXR:C5NT",
"IP": "10.0.2.249",
"Addr": "10.0.2.249:2375",
"Name": "ip-10-0-2-249",
"Cpus": 1,
"Memory": 1038843904,
"Labels": {
"kernelversion": "4.4.0-36-generic",
"operatingsystem": "Ubuntu 16.04.1 LTS",
"storagedriver": "aufs"
},
"Version": "1.12.1",
"DeltaDuration": 0
},
"Name": "/hopeful_bassi",
"RestartCount": 0,
"Driver": "aufs",
"MountLabel": "",
"ProcessLabel": "",
"AppArmorProfile": "",
"ExecIDs": null,
"HostConfig": {
"Binds": null,
"ContainerIDFile": "",
"LogConfig": {
"Type": "json-file",
"Config": {}
},
"NetworkMode": "ops_default",
"PortBindings": {},
"RestartPolicy": {
"Name": "always",
"MaximumRetryCount": 0
},
"AutoRemove": false,
"VolumeDriver": "",
"VolumesFrom": null,
"CapAdd": null,
"CapDrop": null,
"Dns": [],
"DnsOptions": [],
"DnsSearch": [],
"ExtraHosts": null,
"GroupAdd": null,
"IpcMode": "",
"Cgroup": "",
"Links": null,
"OomScoreAdj": 0,
"PidMode": "",
"Privileged": false,
"PublishAllPorts": false,
"ReadonlyRootfs": false,
"SecurityOpt": null,
"UTSMode": "",
"UsernsMode": "",
"ShmSize": 67108864,
"Runtime": "runc",
"ConsoleSize": [
0,
0
],
"Isolation": "",
"CpuShares": 0,
"Memory": 0,
"CgroupParent": "",
"BlkioWeight": 0,
"BlkioWeightDevice": null,
"BlkioDeviceReadBps": null,
"BlkioDeviceWriteBps": null,
"BlkioDeviceReadIOps": null,
"BlkioDeviceWriteIOps": null,
"CpuPeriod": 0,
"CpuQuota": 0,
"CpusetCpus": "",
"CpusetMems": "",
"Devices": [],
"DiskQuota": 0,
"KernelMemory": 0,
"MemoryReservation": 0,
"MemorySwap": 0,
"MemorySwappiness": -1,
"OomKillDisable": false,
"PidsLimit": 0,
"Ulimits": null,
"CpuCount": 0,
"CpuPercent": 0,
"IOMaximumIOps": 0,
"IOMaximumBandwidth": 0
},
"GraphDriver": {
"Name": "aufs",
"Data": null
},
"Mounts": [
{
"Name": "f1db18ac23d2a6078ebbc872e3521432ec5b2f35b171f0327cb9b286924cf711",
"Source": "/var/lib/docker/volumes/f1db18ac23d2a6078ebbc872e3521432ec5b2f35b171f0327cb9b286924cf711/_data",
"Destination": "/data",
"Driver": "local",
"Mode": "",
"RW": true,
"Propagation": ""
}
],
"Config": {
"Hostname": "9dc721640eb7",
"Domainname": "",
"User": "",
"AttachStdin": false,
"AttachStdout": false,
"AttachStderr": false,
"ExposedPorts": {
"6379/tcp": {}
},
"Tty": false,
"OpenStdin": false,
"StdinOnce": false,
"Env": [
"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
"GOSU_VERSION=1.7",
"REDIS_VERSION=3.2.3",
"REDIS_DOWNLOAD_URL=http://download.redis.io/releases/redis-3.2.3.tar.gz",
"REDIS_DOWNLOAD_SHA1=92d6d93ef2efc91e595c8bf578bf72baff397507"
],
"Cmd": [
"redis-server"
],
"Image": "redis",
"Volumes": {
"/data": {}
},
"WorkingDir": "/data",
"Entrypoint": [
"docker-entrypoint.sh"
],
"OnBuild": null,
"Labels": {
"com.docker.swarm.id": "14933ddb8a49a03072eeea60e53a1fa962417799108d2500ffd66eec18d2b490",
"com.docker.swarm.reschedule-policies": "[\"on-node-failure\"]"
}
},
"NetworkSettings": {
"Bridge": "",
"SandboxID": "",
"HairpinMode": false,
"LinkLocalIPv6Address": "",
"LinkLocalIPv6PrefixLen": 0,
"Ports": null,
"SandboxKey": "",
"SecondaryIPAddresses": null,
"SecondaryIPv6Addresses": null,
"EndpointID": "",
"Gateway": "",
"GlobalIPv6Address": "",
"GlobalIPv6PrefixLen": 0,
"IPAddress": "",
"IPPrefixLen": 0,
"IPv6Gateway": "",
"MacAddress": "",
"Networks": {
"ops_default": {
"IPAMConfig": null,
"Links": null,
"Aliases": [
"27f6ba6aecef"
],
"NetworkID": "d1b3d44b7f2845eba36f5a8d0fb3adb51b444b89d96224d9c7df422fc42c6594",
"EndpointID": "",
"Gateway": "",
"IPAddress": "",
"IPPrefixLen": 0,
"IPv6Gateway": "",
"GlobalIPv6Address": "",
"GlobalIPv6PrefixLen": 0,
"MacAddress": ""
}
}
}
}
]
Any news in this case?
In our project we run into same problem.
One of cluster hosts were restarted and then we cannot run bunch of containers because they still exists in network.
We use same workaround as @schmunk42 mentioned - endpoinds were manually removed from key/value storage in consul.
running https://github.com/ndeloof/rpi-voting-app/tree/master/vote-apps on 4 Rapsberry Pis. This is Docker voting app, adapted by @jmMeessen to run on ARM. Running Swarm 1.2.0 (ndeloof/rpi-swarm, based on hypriot's script)
docker-compose up do deploy the app on the cluster. Voting app lands on pi #4. Kill pi4 - really kill it by power unplug, not a king system shutdown
Expected :
container is rescheduled on another pi, restarted, and service restored.
Actual :
failure is detected, container is re-created on another node, attempt to start it bu fail :
Looks to me the overlay network forbid the new container to start as the previous instance hasn't been cleaned before, so the endpoint still exists.