docker swarm fails to start rescheduled container

ndeloof commented 8 years ago

running https://github.com/ndeloof/rpi-voting-app/tree/master/vote-apps on 4 Rapsberry Pis. This is Docker voting app, adapted by @jmMeessen to run on ARM. Running Swarm 1.2.0 (ndeloof/rpi-swarm, based on hypriot's script)

docker-compose up do deploy the app on the cluster. Voting app lands on pi #4. Kill pi4 - really kill it by power unplug, not a king system shutdown

Expected :

container is rescheduled on another pi, restarted, and service restored.

Actual :

failure is detected, container is re-created on another node, attempt to start it bu fail :

time="2016-04-15T18:21:25Z" level=info msg="Removed Engine pi4" 
time="2016-04-15T18:21:25Z" level=info msg="Rescheduled container 7d233bb937df415a62b77a5e88f40e241c518b130cb85232d5eaa38aa05e6966 from pi4 to pi2 as 320470a2a63f975ed5ab8f28bec2ac6fb4e438f3c25d583caf33afbe46af2811" 
time="2016-04-15T18:21:25Z" level=info msg="Container 7d233bb937df415a62b77a5e88f40e241c518b130cb85232d5eaa38aa05e6966 was running, starting container 320470a2a63f975ed5ab8f28bec2ac6fb4e438f3c25d583caf33afbe46af2811"
time="2016-04-15T18:21:26Z" level=error msg="Failed to start rescheduled container 320470a2a63f975ed5ab8f28bec2ac6fb4e438f3c25d583caf33afbe46af2811: 500 Internal Server Error: service endpoint with name voteapps_voting-app_1 already exists\n" 
time="2016-04-15T18:21:26Z" level=info msg="Rescheduled container e23d927e42625e63d0b6c6ed284fa05b208972caecf1b7640b51998e96679c46 from pi4 to pi3 as 4ab4282dc488c553ea10392550f9ba0bdab9d6026b6c3c793c3858be89b7ac19" 
time="2016-04-15T18:21:26Z" level=info msg="Container e23d927e42625e63d0b6c6ed284fa05b208972caecf1b7640b51998e96679c46 was running, starting container 4ab4282dc488c553ea10392550f9ba0bdab9d6026b6c3c793c3858be89b7ac19" 
time="2016-04-15T18:21:27Z" level=error msg="Failed to start rescheduled container 4ab4282dc488c553ea10392550f9ba0bdab9d6026b6c3c793c3858be89b7ac19: 500 Internal Server Error: service endpoint with name voteapps_worker_1 already exists\n"

Looks to me the overlay network forbid the new container to start as the previous instance hasn't been cleaned before, so the endpoint still exists.

ndeloof commented 8 years ago

@vieux @abronan any thoughts ?

vieux commented 8 years ago

@ndeloof I'll take a look asap.

schmunk42 commented 8 years ago

Same here on our staging swarm with 1.1.3 - custom rescheduling.

We were starting stacks with docker-compose; two stacks which have containers connected in a network together (separate compose files, with container affinity constrains) show errors.

So I tried to manually remove the network

docker network rm myapp_default/app.myapp.com# docker network rm myapp_default
Error response from daemon: 500 Internal Server Error: network myapp_default has active endpoints

Inspected it...

root@sepp-roj:/repo/stacks/auto/myapp/app.myapp.com# docker network inspect myapp_default
docker network inspect myapp_default
[
    {
        "Name": "myapp_default",
        "Id": "f67879d64b0229990bcd9c43e1e57630fd548eba0afabdf85b33022afee73d80",
        "Scope": "global",
        "Driver": "overlay",
        "IPAM": {
            "Driver": "default",
            "Options": null,
            "Config": [
                {
                    "Subnet": "10.0.18.0/24",
                    "Gateway": "10.0.18.1/24"
                }
            ]
        },
        "Containers": {
            "ep-61307b606ed6ee0176cbd6acb7b3d031a182851bc343c9ac4c4466cb7700136d": {
                "Name": "myapp_worker_1",
                "EndpointID": "61307b606ed6ee0176cbd6acb7b3d031a182851bc343c9ac4c4466cb7700136d",
                "MacAddress": "02:42:0a:00:12:05",
                "IPv4Address": "10.0.18.5/24",
                "IPv6Address": ""
            },
            "ep-8252faf56a8732db71ddb35ce90a803ad9fc44fd92cfa684aa6c5faef8b23ead": {
                "Name": "myapp_redis_1",
                "EndpointID": "8252faf56a8732db71ddb35ce90a803ad9fc44fd92cfa684aa6c5faef8b23ead",
                "MacAddress": "02:42:0a:00:12:02",
                "IPv4Address": "10.0.18.2/24",
                "IPv6Address": ""
            },
            "ep-8265326f50459e8f6314b9f9496fcf84aa628eb840a6885ac3eb6266d66a1de6": {
                "Name": "myapp_nginx_1",
                "EndpointID": "8265326f50459e8f6314b9f9496fcf84aa628eb840a6885ac3eb6266d66a1de6",
                "MacAddress": "02:42:0a:00:12:04",
                "IPv4Address": "10.0.18.4/24",
                "IPv6Address": ""
            },
            "ep-da381f7ad537a548457878900e588538ce46d277b0ae03c6db4c578be7c65ceb": {
                "Name": "myapp_php_1",
                "EndpointID": "da381f7ad537a548457878900e588538ce46d277b0ae03c6db4c578be7c65ceb",
                "MacAddress": "02:42:0a:00:12:03",
                "IPv4Address": "10.0.18.3/24",
                "IPv6Address": ""
            }
        },
        "Options": {}
    }
]

And tried to remove the containers, which fails 💥

root@sepp-roj:/repo/stacks/auto/myapp/app.myapp.com# docker inspect myapp_php_1
docker inspect myapp_php_1
[]
Error: No such image or container: myapp_php_1

Trying to figure out a way to start the stacks, without renaming them.

PS: I think our overlay is not working 100% properly, but this does not affect our deployments at the moment, since they all end up on the same node.

schmunk42 commented 8 years ago

Possibly related:

schmunk42 commented 8 years ago

After trying "everything" from docker rm over docker network rm, docker-compose down, etc.. The only workaround I found is to manually remove the keys from our consul discovery service.

If a container is registered with a network, but the container is no longer existing (for some reason - eg. node constraints failed in our case), it is not possible to remove the network (retried with swarm 1.2.1-rc1) or to remove the container from the network (since docker complains about a non-existing container 😄).

There should either be a --force option for network rm or network disconnect.

pballester commented 8 years ago

same problem here rescheduling a container in a overlay network generated by docker-compose, i think the same, new container is failing to start because the previous instance hasn't been cleaned.

OlgaIvantsova commented 8 years ago

The same problem on swarm 1.2.1

dongluochen commented 8 years ago

network disconnect has force option since Docker 1.10. Can you let us know if that can manually resolve your problem? If yes, Swarm can take this logic to clean up an endpoint.

$ docker -H swarm-master-0:2375 network disconnect --help

Usage:  docker network disconnect [OPTIONS] NETWORK CONTAINER

Disconnects container from a network

  -f, --force        Force the container to disconnect from a network
  --help             Print usage

doronp commented 8 years ago

more in #2149

svscorp commented 8 years ago

@dongluochen after removing a container from a network it runs in another issue:

System error: nosandbox: error locating sandbox id e671d6cc9648672e5776020a09354310d764c4edc13dd67e60efc0d50e23860f: no sandbox found

dongluochen commented 8 years ago

@ndeloof @svscorp @schmunk42 #2436 fixes the issue of rescheduling containers with overlay network. The fix is included in Swarm 1.2.5. Can you test 1.2.5 to see if your problem is resolved? Your feedback is appreciated.

goruha commented 8 years ago

on swarm 1.2.5 This fix create container but does not start it, even if you set restart always. in logs I see

time="2016-09-03T18:33:09Z" level=error msg="Flagging engine as unhealthy. Connect failed 3 times" id="TCTR:BV5C:25BT:GRGL:L5DN:JBAF:VONJ:Z5NH:JARW:BVH4:CITP:JQAS" name=ip-10-0-2-212 
time="2016-09-03T18:33:09Z" level=warning msg="Failed to remove network endpoint from old container hopeful_bassi: Error response from daemon: endpoint hopeful_bassi not found" 
time="2016-09-03T18:33:09Z" level=info msg="Rescheduled container 9dc721640eb7497ece709cf5572cc352379c77b482c62f1fbfe6aacd99bc4161 from ip-10-0-2-212 to ip-10-0-2-249 as 27f6ba6aecefad13b17cebbffb221b146071851e9b562e11a8b4b60745aeca15" 
time="2016-09-03T18:34:05Z" level=error msg="Update engine specs failed: Cannot connect to the Docker daemon. Is the docker daemon running on this host?" id="TCTR:BV5C:25BT:GRGL:L5DN:JBAF:VONJ:Z5NH:JARW:BVH4:CITP:JQAS" name=ip-10-0-2-212 
time="2016-09-03T18:35:15Z" level=info msg="Removed Engine ip-10-0-2-212"

ubuntu@ip-10-0-1-8:~$ docker -H :4000 ps -a
CONTAINER ID        IMAGE                    COMMAND                  CREATED             STATUS              PORTS               NAMES
27f6ba6aecef        redis                    "docker-entrypoint.sh"   5 minutes ago       Created                                 ip-10-0-2-249/hopeful_bassi
56f2595af82e        gliderlabs/registrator   "/bin/registrator --i"   29 minutes ago      Up 12 minutes                           ip-10-0-2-249/registrator
ebd37173dda6        swarm:1.2.5              "/swarm --experimenta"   29 minutes ago      Up 12 minutes       2375/tcp            ip-10-0-2-249/swarm

ubuntu@ip-10-0-1-8:~$ docker -H :4000 inspect 27f6ba6aecef
[
    {
        "Id": "27f6ba6aecefad13b17cebbffb221b146071851e9b562e11a8b4b60745aeca15",
        "Created": "2016-09-03T18:33:09.764678137Z",
        "Path": "docker-entrypoint.sh",
        "Args": [
            "redis-server"
        ],
        "State": {
            "Status": "created",
            "Running": false,
            "Paused": false,
            "Restarting": false,
            "OOMKilled": false,
            "Dead": false,
            "Pid": 0,
            "ExitCode": 0,
            "Error": "",
            "StartedAt": "0001-01-01T00:00:00Z",
            "FinishedAt": "0001-01-01T00:00:00Z"
        },
        "Image": "sha256:50e38ce0458ffbd0edb6b340287a38e44263c80abe20739492c8faa0e3281465",
        "ResolvConfPath": "",
        "HostnamePath": "",
        "HostsPath": "",
        "LogPath": "",
        "Node": {
            "ID": "B3K4:SH4I:3WHO:3CLE:SQP3:MGRF:7PGS:YWXZ:UGPG:7SL4:PWXR:C5NT",
            "IP": "10.0.2.249",
            "Addr": "10.0.2.249:2375",
            "Name": "ip-10-0-2-249",
            "Cpus": 1,
            "Memory": 1038843904,
            "Labels": {
                "kernelversion": "4.4.0-36-generic",
                "operatingsystem": "Ubuntu 16.04.1 LTS",
                "storagedriver": "aufs"
            },
            "Version": "1.12.1",
            "DeltaDuration": 0
        },
        "Name": "/hopeful_bassi",
        "RestartCount": 0,
        "Driver": "aufs",
        "MountLabel": "",
        "ProcessLabel": "",
        "AppArmorProfile": "",
        "ExecIDs": null,
        "HostConfig": {
            "Binds": null,
            "ContainerIDFile": "",
            "LogConfig": {
                "Type": "json-file",
                "Config": {}
            },
            "NetworkMode": "ops_default",
            "PortBindings": {},
            "RestartPolicy": {
                "Name": "always",
                "MaximumRetryCount": 0
            },
            "AutoRemove": false,
            "VolumeDriver": "",
            "VolumesFrom": null,
            "CapAdd": null,
            "CapDrop": null,
            "Dns": [],
            "DnsOptions": [],
            "DnsSearch": [],
            "ExtraHosts": null,
            "GroupAdd": null,
            "IpcMode": "",
            "Cgroup": "",
            "Links": null,
            "OomScoreAdj": 0,
            "PidMode": "",
            "Privileged": false,
            "PublishAllPorts": false,
            "ReadonlyRootfs": false,
            "SecurityOpt": null,
            "UTSMode": "",
            "UsernsMode": "",
            "ShmSize": 67108864,
            "Runtime": "runc",
            "ConsoleSize": [
                0,
                0
            ],
            "Isolation": "",
            "CpuShares": 0,
            "Memory": 0,
            "CgroupParent": "",
            "BlkioWeight": 0,
            "BlkioWeightDevice": null,
            "BlkioDeviceReadBps": null,
            "BlkioDeviceWriteBps": null,
            "BlkioDeviceReadIOps": null,
            "BlkioDeviceWriteIOps": null,
            "CpuPeriod": 0,
            "CpuQuota": 0,
            "CpusetCpus": "",
            "CpusetMems": "",
            "Devices": [],
            "DiskQuota": 0,
            "KernelMemory": 0,
            "MemoryReservation": 0,
            "MemorySwap": 0,
            "MemorySwappiness": -1,
            "OomKillDisable": false,
            "PidsLimit": 0,
            "Ulimits": null,
            "CpuCount": 0,
            "CpuPercent": 0,
            "IOMaximumIOps": 0,
            "IOMaximumBandwidth": 0
        },
        "GraphDriver": {
            "Name": "aufs",
            "Data": null
        },
        "Mounts": [
            {
                "Name": "f1db18ac23d2a6078ebbc872e3521432ec5b2f35b171f0327cb9b286924cf711",
                "Source": "/var/lib/docker/volumes/f1db18ac23d2a6078ebbc872e3521432ec5b2f35b171f0327cb9b286924cf711/_data",
                "Destination": "/data",
                "Driver": "local",
                "Mode": "",
                "RW": true,
                "Propagation": ""
            }
        ],
        "Config": {
            "Hostname": "9dc721640eb7",
            "Domainname": "",
            "User": "",
            "AttachStdin": false,
            "AttachStdout": false,
            "AttachStderr": false,
            "ExposedPorts": {
                "6379/tcp": {}
            },
            "Tty": false,
            "OpenStdin": false,
            "StdinOnce": false,
            "Env": [
                "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
                "GOSU_VERSION=1.7",
                "REDIS_VERSION=3.2.3",
                "REDIS_DOWNLOAD_URL=http://download.redis.io/releases/redis-3.2.3.tar.gz",
                "REDIS_DOWNLOAD_SHA1=92d6d93ef2efc91e595c8bf578bf72baff397507"
            ],
            "Cmd": [
                "redis-server"
            ],
            "Image": "redis",
            "Volumes": {
                "/data": {}
            },
            "WorkingDir": "/data",
            "Entrypoint": [
                "docker-entrypoint.sh"
            ],
            "OnBuild": null,
            "Labels": {
                "com.docker.swarm.id": "14933ddb8a49a03072eeea60e53a1fa962417799108d2500ffd66eec18d2b490",
                "com.docker.swarm.reschedule-policies": "[\"on-node-failure\"]"
            }
        },
        "NetworkSettings": {
            "Bridge": "",
            "SandboxID": "",
            "HairpinMode": false,
            "LinkLocalIPv6Address": "",
            "LinkLocalIPv6PrefixLen": 0,
            "Ports": null,
            "SandboxKey": "",
            "SecondaryIPAddresses": null,
            "SecondaryIPv6Addresses": null,
            "EndpointID": "",
            "Gateway": "",
            "GlobalIPv6Address": "",
            "GlobalIPv6PrefixLen": 0,
            "IPAddress": "",
            "IPPrefixLen": 0,
            "IPv6Gateway": "",
            "MacAddress": "",
            "Networks": {
                "ops_default": {
                    "IPAMConfig": null,
                    "Links": null,
                    "Aliases": [
                        "27f6ba6aecef"
                    ],
                    "NetworkID": "d1b3d44b7f2845eba36f5a8d0fb3adb51b444b89d96224d9c7df422fc42c6594",
                    "EndpointID": "",
                    "Gateway": "",
                    "IPAddress": "",
                    "IPPrefixLen": 0,
                    "IPv6Gateway": "",
                    "GlobalIPv6Address": "",
                    "GlobalIPv6PrefixLen": 0,
                    "MacAddress": ""
                }
            }
        }
    }
]

loockass commented 7 years ago

Any news in this case?

In our project we run into same problem. One of cluster hosts were restarted and then we cannot run bunch of containers because they still exists in network.
We use same workaround as @schmunk42 mentioned - endpoinds were manually removed from key/value storage in consul.

docker-archive / classicswarm

docker swarm fails to start rescheduled container #2133

Expected :

Actual :