Open meermanr opened 4 years ago
Correction: Draining the node deletes the network itself, but reactivating the node brings the erroneous behaviour right back!
Draining the node, restarting dockerd, and then reactivating didn't seem to change things, for the swarm-launched services, but when I attempt to run my test case again I got a new error:
docker run -it --rm --network mpdti_default alpine sh -c 'apk --quiet add bind-tools && dig +noall +answer tasks.network_manager'
Unable to find image 'alpine:latest' locally
latest: Pulling from library/alpine
c9b1b535fdd9: Pull complete
Digest: sha256:ab00606a42621fb68f2ed6ad3c88be54397f981a7b70a79db3d1172b11c4367d
Status: Downloaded newer image for alpine:latest
docker: Error response from daemon: failed to get network during CreateEndpoint: network dhw54uswu0kb28hc6crasj0xu not found.
But second attempt worked:
docker run -it --rm --network mpdti_default alpine sh -c 'apk --quiet add bind-tools && dig +noall +answer tasks.network_manager'
tasks.network_manager. 600 IN A 172.31.8.159
Network description, in case it helps:
# docker network inspect dhw54uswu0kb28hc6crasj0xu
[
{
"Name": "mpdti_default",
"Id": "dhw54uswu0kb28hc6crasj0xu",
"Created": "2020-02-26T22:57:33.354757034Z",
"Scope": "swarm",
"Driver": "overlay",
"EnableIPv6": false,
"IPAM": {
"Driver": "default",
"Options": null,
"Config": [
{
"Subnet": "172.31.0.0/16",
"Gateway": "172.31.0.1"
}
]
},
"Internal": false,
"Attachable": true,
"Ingress": false,
"ConfigFrom": {
"Network": ""
},
"ConfigOnly": false,
"Containers": {
"71a921da61e5d318a3646d3069cdf4bdc7f5a4a1d22536b83882acb485872cef": {
"Name": "angry_austin",
"EndpointID": "b4b6701ed28f6cc25cbe8925d12d06c15de25079c67b9b78faea68305975855e",
"MacAddress": "02:42:ac:1f:28:e9",
"IPv4Address": "172.31.40.233/16",
"IPv6Address": ""
},
"9e34cb5543869c0f843ce21f6714c5af9fafc969b898b1cadfce985b2e2c3c0b": {
"Name": "mpdti_worker_build_pool_android.66l12d1hqd2ol1i5zi7nbil09.3lu9csnhd5rnu0xvp3f6ctjye",
"EndpointID": "38bb9fb13d3f9f9ed1ae511b1879c759faaa0a463d313b9b222245f5950d2c44",
"MacAddress": "02:42:ac:1f:28:e5",
"IPv4Address": "172.31.40.229/16",
"IPv6Address": ""
},
"lb-mpdti_default": {
"Name": "mpdti_default-endpoint",
"EndpointID": "ea1e94211ff1063138e16b1877e5d5f346134b153f45fdea07565f8cf4bb1895",
"MacAddress": "02:42:ac:1f:28:d5",
"IPv4Address": "172.31.40.213/16",
"IPv6Address": ""
}
},
"Options": {
"com.docker.network.driver.overlay.vxlanid_list": "4114"
},
"Labels": {
"com.docker.stack.namespace": "mpdti"
},
"Peers": [
{
"Name": "f151939a89e3",
"IP": "10.58.203.92"
},
{
"Name": "c62197e7f8a9",
"IP": "10.58.203.47"
},
{
"Name": "218ff8ebeea6",
"IP": "10.58.203.55"
},
{
"Name": "9aaf335debf1",
"IP": "10.58.203.61"
},
{
"Name": "6025bd2955aa",
"IP": "10.58.203.73"
},
{
"Name": "4d5c23a8859c",
"IP": "10.58.203.74"
},
{
"Name": "db2091586806",
"IP": "10.58.203.52"
},
{
"Name": "caf06a766c03",
"IP": "10.58.203.58"
},
{
"Name": "554ce305eec0",
"IP": "10.58.203.64"
},
{
"Name": "f289986309b0",
"IP": "10.58.203.54"
},
{
"Name": "55f67db8b55f",
"IP": "10.58.203.67"
},
{
"Name": "e6604b988202",
"IP": "10.58.203.62"
},
{
"Name": "f7a93b694822",
"IP": "10.58.203.63"
},
{
"Name": "282e8ce82a81",
"IP": "10.58.203.77"
},
{
"Name": "1e4cb1e5edbc",
"IP": "10.58.203.91"
},
{
"Name": "6073b6866f73",
"IP": "10.58.203.39"
},
{
"Name": "d87e894edd21",
"IP": "10.58.203.69"
},
{
"Name": "0882ec36e09d",
"IP": "10.58.203.75"
},
{
"Name": "4d5a1ca24529",
"IP": "10.58.203.53"
},
{
"Name": "ac6107fc6447",
"IP": "10.58.203.66"
},
{
"Name": "cec4c94db5ce",
"IP": "10.58.203.56"
},
{
"Name": "78c5f90440a1",
"IP": "10.58.203.83"
},
{
"Name": "89d3a7eded8a",
"IP": "10.58.203.60"
},
{
"Name": "d433c8e3bb92",
"IP": "10.58.203.57"
},
{
"Name": "6004340b87c8",
"IP": "10.58.203.40"
},
{
"Name": "8146fd246f70",
"IP": "10.58.203.71"
},
{
"Name": "0fa5e6eae40a",
"IP": "10.58.203.94"
},
{
"Name": "3e13891b8af5",
"IP": "10.58.203.86"
},
{
"Name": "df19351d2cc2",
"IP": "10.58.203.41"
},
{
"Name": "b21a92b35a37",
"IP": "10.58.203.81"
},
{
"Name": "2e75b279011b",
"IP": "10.58.203.78"
},
{
"Name": "fbd0e868175d",
"IP": "10.58.203.50"
},
{
"Name": "c01ac1cef12b",
"IP": "10.58.203.87"
},
{
"Name": "d021d6194886",
"IP": "10.58.203.89"
},
{
"Name": "b0ce820e7342",
"IP": "10.58.203.72"
},
{
"Name": "4b63c3ecb279",
"IP": "10.58.203.76"
},
{
"Name": "e29eaf6d3bb6",
"IP": "10.58.203.79"
},
{
"Name": "aa0f360100c9",
"IP": "10.58.203.90"
}
]
}
]
Expected behavior
Services created by
docker service create
should be able to resolve the names of all other services within the same overlay network.Actual behavior
Some nodes in my swarm cluster are unable to resolve some service names in a given overlay network.
For example, using pssh to run a quick experiment on all the nods in my swarm:
Note that two of the nodes did not return an IP address (but
dig
still exited cleanly):To get things working again, I've resorted to draining the node, deleting the network, and then making the node available again. So this feels like a missed update / state synchronisation issue to me.
Steps to reproduce the behavior
Not sure. Seems to happen after doing repeated
docker stack deploy
against the same stack over and over, but using different YAML files (so only touching a subset of the stack at a time). It only seems to happen when the cluster is under high load and generally unresponsive (high CPU usage on multiple hosts, as we've not tuned resource limits yet).I suspect (but cannot yet prove) that
docker service rm
followed bydocker stack deploy
recreating the service before the containers have exited may be causing this. We've set thestop_grace_period
to multiple days (simulation workloads which are expensive to restart), and I've anecdotally noticed that swarm loses its tasks governing containers that have been signalled but not yet exited.Output of
docker version
:(Same on all hosts)
Output of
docker info
:Additional environment details (AWS, VirtualBox, physical, etc.)
Bare-metal installation of Ubuntu 18.04 with only Docker CE and some utilities (tmux, vim, etc). The hosts have 2x20 CPU (no hyperthreading) and 768 GIB RAM, so when the cluster is under load there are a lot of processes on a given node competing for attention. I suspect I may need to tune buffer sizes somewhere.
Every node in the swarm has a bonded network interface made up of 4x NICs as below, so I'm not sure exactly how overlay traffic passes between hosts - there may be more than one path.
Reporter's thoughts
As much as anything, I'm looking to learn how to debug this. I've crawled the documentation and researched this as best I can. I'm fairly confident this is going to repeat for me.