Closed rwky closed 8 years ago
@yifan-gu you were showing me this behavior the other day, but I forget what was causing it. Can you refresh my memory?
@crawford @rwky I reported in this docker issue. https://github.com/docker/docker/issues/13885#issuecomment-149720279
Running into this as well, see kubernetes/kubernetes#17525
@rwky have you experienced this after your initial report?
@mischief yep at least once a week.
@rwky any tips on reproducing it? is it intermittent, or are you doing something to work around it?
I don't know how to predictably replicate it, however the server in question spawns and destroys (using the --rm flag) several docker containers every minute. I suspect that just running a CoreOS server that does that for a few days will trigger the hang.
The only way to resolve is to reboot the server (you could probably kill and restart docker but rebooting does the job).
I'm seeing the same problem now (I'm on IRC, and will post the content of IRC here), I'll have to reboot the server in 2 hours, if in the mean time, you want me to run commands, let me know! (I run CoreOS stable)
Ok, I'm rebooting now, here is the IRC discussion: https://framabin.org/?3b8e73977c597b06#7wIlBFnWRaGO0+o34QkfauXwhm2kSvbk+NInpxe6kes=
Same for me on docker 1.10.1 after intensive using of dockerui web app (https://hub.docker.com/r/crosbymichael/dockerui/).
I experienced similar hang with CoreOS stable (835.12.0). I could not kill -9 the docker daemon or any of its processes. Had to reboot the machine.
Okay, I have no idea if this is the same issue, but it is reproducible with this command:
/usr/bin/docker run --pid=host --privileged -v /etc/kubernetes/manifests:/etc/kubernetes/manifests -v /var/run/docker.sock:/var/run/docker.sock --rm --name kubelet gcr.io/google_containers/hyperkube:v1.1.8 /hyperkube kubelet --v=2 --config=/etc/kubernetes/manifests
+1 for a rapid resolution of this issue on CoreOS 970.1.0 with Docker 1.10.1 with kubernetes 1.1.7. Is there an active investigation of the problem? There was a suggestion that it might be a kernel issue (https://github.com/docker/docker/issues/13885#issuecomment-190147123) but I am not sure how it is addressed or should be addressed in CoreOS.
@gopinatht can you give the Beta channel a shot? That has the upstream fix for device mapper.
@crawford Thanks for the info. I will give it a try.
Just to confirm, the alpha channel does not have the fix (I am using the absolute latest version of Alpha)? Also, docker info
output says the storage driver is overlay:
Containers: 26 Running: 10 Paused: 0 Stopped: 16 Images: 12 Server Version: 1.10.1 Storage Driver: overlay Backing Filesystem: extfs Execution Driver: native-0.2 Logging Driver: json-file Plugins: Volume: local Network: host bridge null Kernel Version: 4.4.1-coreos Operating System: CoreOS 970.1.0 (Coeur Rouge) OSType: linux Architecture: x86_64 CPUs: 2 Total Memory: 3.862 GiB Name: ip-1x-xx-xx-xx.ec2.internal ID: UVON:7XXW:5AZD:QJTG:2DAW:6Y52:CKN3:EVZZ:WESO:COSQ:4UGC:VKNO
Sorry, I missed the CoreOS version before. That kernel also has the patch. I doubt it will behave any differently.
@crawford No worries. Any other insights on where I can investigate? I am kind of stuck on what I should do next. This issue has become a blocker for my team using a CoreOS/kubernetes/docker solution.
Guys, I logged an issue with Docker team and they are indicating that the docker build on CoreOS is different enough (different Go version and custom build on CoreOS) (https://github.com/docker/docker/issues/20871#issuecomment-191397335) for this to be more of a CoreOS issue.
Help or pointers from CoreOS team would be appreciated.
I really don't know where to start on this one. We are going to bump to Docker 1.10.2 in the next Alpha and then bump to Go 1.5.3 in the following one. So, by this time next week, we will have a mostly vanilla Docker shipping in CoreOS. Given how long this bug has been open, I don't expect the version bump to change anything, but at least we'll be able to debug with the Docker folks a bit easier. Let's reevaluate the situation next week.
To help debugging. If I add the -v /sys:/sys:ro
option to the command I mentioned above the kubelet container and docker daemon does not hang.
So the full command that does not seem to hang:
/usr/bin/docker run --pid=host --privileged -v /sys:/sys:ro -v /etc/kubernetes/manifests:/etc/kubernetes/manifests -v /var/run/docker.sock:/var/run/docker.sock --rm --name kubelet gcr.igoogle_containers/hyperkube:v1.1.8 /hyperkube kubelet --v=2 --config=/etc/kubernetes/manifests
@vpal Thanks for this info. My system is a little different in that I am not using hyperkube on my AWS cluster. But I wonder why mounting /sys as read only mount should solve the docker hang issue. I am not able to make the connection.
@crawford I investigated the issue further and posted the results of my investigation there: https://github.com/docker/docker/issues/20871#issuecomment-192081335
FYI.
I asked this question on the docker bug page but thought it would be relevant here as well: Could this issue be related in any way to the hairpin NAT setting? Is is possible there is some sort of thread deadlocks on a device? When I ran systemctl
on my CoreOS distribution, it did not show any device with name vethbfe8f8f
. Could that be the source of the issue?
@gopinatht as of 983.0.0, we are shipping Docker built with Go 1.5.3, so that should eliminate a few variables.
@crawford Thanks for the heads-up. I will check it out.
I came across same issue with docker 1.8.3. I noticed docker built with Go Version 1.4.2 . is it a known bug in docker 1.8.3 ? @gopinatht any luck using docker built with higher go version? Thanks
[root@abhi ]# docker version Client: Version: 1.8.3 API version: 1.20 Go version: go1.4.2 Git commit: f4bf5c7 Built: Mon Oct 12 06:06:01 UTC 2015 OS/Arch: linux/amd64
Server: Version: 1.8.3 API version: 1.20 Go version: go1.4.2 Git commit: f4bf5c7 Built: Mon Oct 12 06:06:01 UTC 2015 OS/Arch: linux/amd64 [root@localhost abhi]# docker info Containers: 20 Images: 324 Storage Driver: devicemapper Pool Name: docker-8:7-268591717-pool Pool Blocksize: 65.54 kB Backing Filesystem: xfs Data file: /dev/vg-abhi/data Metadata file: /dev/vg-abhi/metadata Data Space Used: 16.12 GB Data Space Total: 107.4 GB Data Space Available: 91.25 GB Metadata Space Used: 26.04 MB Metadata Space Total: 5.369 GB Metadata Space Available: 5.343 GB Udev Sync Supported: true Deferred Removal Enabled: false Library Version: 1.02.93-RHEL7 (2015-01-28) Execution Driver: native-0.2 Logging Driver: json-file Kernel Version: 3.10.0-229.el7.x86_64 Operating System: CentOS Linux 7 (Core) CPUs: 48 Total Memory: 125.6 GiB Name: localhost.localdomain ID: K76J:2RP6:HKQX:NMJS:AA5D:EXY2:VGGD:5YB4:KRFI:3HW6:WWG4:6QDY WARNING: bridge-nf-call-iptables is disabled WARNING: bridge-nf-call-ip6tables is disabled
Is this still an issue with 1.10.3? I haven't seen this one in a while (in fact, I forgot about this bug).
@crawford I've not seen this on 1.10.3 so far as I can remember, the server I normally see it on runs the beta channel so has been on 1.10.3 for a while.
Just checked my logs, not seen it since May 3rd.
@rwky Thank you for checking. Let's go ahead and close this one. We can reopen it if it turns out to still be a problem.
seeing this still:
core@ip-172-20-13-165 ~ $ docker info
Containers: 50
Running: 42
Paused: 0
Stopped: 8
Images: 241
Server Version: 1.10.3
Storage Driver: overlay
Backing Filesystem: extfs
Execution Driver: native-0.2
Logging Driver: json-file
Plugins:
Volume: local
Network: bridge null host
Kernel Version: 4.5.7-coreos
Operating System: CoreOS 1010.6.0 (MoreOS)
OSType: linux
Architecture: x86_64
CPUs: 16
Total Memory: 29.44 GiB
Name: ip-172-20-13-165.ec2.internal
ID: SS63:BOHR:326B:NZUQ:BOMB:43U2:XOLL:UCT7:BDOS:LBD3:DQJK:O52K
We see this quite often on 1068.6.0 and Kubernetes 1.3.3. Please consider reopening this issue.
core@ip-10-72-16-158 ~ $ docker info
Containers: 135
Running: 77
Paused: 0
Stopped: 58
Images: 59
Server Version: 1.10.3
Storage Driver: overlay
Backing Filesystem: extfs
Execution Driver: native-0.2
Logging Driver: json-file
Plugins:
Volume: local
Network: bridge null host
Kernel Version: 4.6.3-coreos
Operating System: CoreOS 1068.6.0 (MoreOS)
OSType: linux
Architecture: x86_64
CPUs: 40
Total Memory: 157.4 GiB
Name: ip-10-72-16-158.us-west-2.compute.internal
ID: F6BA:V7OB:IYIQ:QGDZ:GQCP:2ZUW:A7S7:GE73:AG6Z:SLND:CFKU:A7R3
Also seeing these a lot in journal -u docker.service
:
Jul 30 00:40:29 ip-10-72-16-158.us-west-2.compute.internal dockerd[43231]: time="2016-07-30T00:40:29.198751686Z" level=info msg="Container bb8b577470e1ec6be491414d7d39df767a113e4fb6efbc3178c7b73ea14ff15e failed to exit within 30 seconds of SIGTERM - using the force"
Jul 30 00:41:09 ip-10-72-16-158.us-west-2.compute.internal dockerd[43231]: time="2016-07-30T00:41:09.470564310Z" level=info msg="Container d5677abe86893968f12f1d3db15ef0097dbd218d59c52e4c06d809d8c0205664 failed to exit within 30 seconds of SIGTERM - using the force"
Jul 30 00:41:09 ip-10-72-16-158.us-west-2.compute.internal dockerd[43231]: time="2016-07-30T00:41:09.470599085Z" level=info msg="Container 58b3ac90b2fef33b7cb76731c2df1d56d5459be0e7f0312169e6d47b599f6d1d failed to exit within 30 seconds of SIGTERM - using the force"
Jul 30 00:41:09 ip-10-72-16-158.us-west-2.compute.internal dockerd[43231]: time="2016-07-30T00:41:09.470564635Z" level=info msg="Container 85d9338bddc0553f079651c737fa3c165df53ffe35a852e579a11eb1732316c0 failed to exit within 30 seconds of SIGTERM - using the force"
Jul 30 00:41:10 ip-10-72-16-158.us-west-2.compute.internal dockerd[43231]: time="2016-07-30T00:41:10.627024947Z" level=error msg="Handler for POST /containers/36b70ef81bd78b765f58245a49883b4e11b3ab291ce4e67535c4e9314921af10/stop returned error: Container already stopped"
This just happened to me again, however at the same time I'm experiencing https://github.com/docker/docker/issues/5618 which I suspect is the cause @feelobot @zihaoyu perhaps you are experiencing the same thing?
@rwky Not sure if it is related, but I did see
Aug 04 11:10:00 ip-10-72-16-158.us-west-2.compute.internal kernel: unregister_netdevice: waiting for veth9c879be to become free. Usage count = 1```
Yep that looks like it's the problem.
This is happening for me too.
Docker version 1.10.3, build 3cd164c
CoreOS stable (1068.8.0)
Kernel Version: 4.6.3-coreos OS Image: CoreOS 1068.8.0 (MoreOS) Container Runtime Version: docker://1.10.3 Kubelet Version: v1.2.4.7+cb08580860a02a-dirty Kube-Proxy Version: v1.2.4.7+cb08580860a02a-dirty
docker info
Containers: 131 Running: 90 Paused: 0 Stopped: 41 Images: 17 Server Version: 1.10.3 Storage Driver: overlay Backing Filesystem: extfs Execution Driver: native-0.2 Logging Driver: json-file Plugins: Volume: local Network: null host bridge Kernel Version: 4.6.3-coreos Operating System: CoreOS 1068.8.0 (MoreOS) OSType: linux Architecture: x86_64 CPUs: 4 Total Memory: 13.7 GiB
dmesg shows the folowing
[163194.652062] unregister_netdevice: waiting for vethdd22a32 to become free. Usage count = 1
We are using Flannal driver
Echo @rwky: reported details here: https://github.com/docker/docker/issues/5618#issuecomment-243846064 . I'm probably going to try @sercand 's suggestion (https://github.com/docker/docker/issues/5618#issuecomment-229878598)
It should probably also be mentioned that this of course affects, by default, clusters build by the kube-aws
project.
sounds like a dup of https://github.com/coreos/bugs/issues/254
@mischief, doesn't look like a dup, because we run container the same way as @rwky (every minute with --rm) and I can see that docker hangs exactly on the container, long before unregistered lo
message happens. On some machines there's no unregistered lo
message at all, but docker ps
is hanging.
docker -D info
Containers: 6
Running: 2
Paused: 0
Stopped: 4
Images: 6
Server Version: 1.11.2
Storage Driver: overlay
Backing Filesystem: extfs
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: host bridge null
Kernel Version: 4.7.3-coreos-r2
Operating System: CoreOS 1185.3.0 (MoreOS)
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 7.8 GiB
Name: <hostname>
ID: GFHQ:Y6IV:XCQI:Y2NA:PCQN:DEII:OSPZ:LENL:OCFU:6CBI:MDFV:EMD7
Docker Root Dir: /var/lib/docker
Debug mode (client): true
Debug mode (server): false
Username: <username>
Registry: https://index.docker.io/v1/
I don't have unregister_netdevice
in dmesg, but docker ps hangs like this:
$ docker ps -n 17
# works
$ docker ps -n 18
# hangs
strace:
$ strace -s 128 docker ps -n 18
clock_gettime(CLOCK_MONOTONIC, {484793, 504192514}) = 0
futex(0xc820064108, FUTEX_WAKE, 1) = 1
read(5, 0xc820400000, 4096) = -1 EAGAIN (Resource temporarily unavailable)
write(5, "GET /v1.23/containers/json?limit=18 HTTP/1.1\r\nHost: \r\nUser-Agent: Docker-Client/1.11.2 (linux)\r\n\r\n", 98) = 98
futex(0xc820064108, FUTEX_WAKE, 1) = 1
futex(0x22509c8, FUTEX_WAIT, 0, NULL
so $ curl --unix-socket /var/run/docker.sock http://localhost/v1.23/containers/json?limit=17
works, but curl --unix-socket /var/run/docker.sock http://dolantrump/v1.23/containers/json?limit=18
hangs.
Coreos: VERSION=1185.3.0 BUILD_ID=2016-11-01-0605
$ docker info
Containers: 52
Running: 28
Paused: 0
Stopped: 24
Images: 110
Server Version: 1.11.2
Storage Driver: overlay
Backing Filesystem: extfs
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge null host
Kernel Version: 4.7.3-coreos-r2
Operating System: CoreOS 1185.3.0 (MoreOS)
OSType: linux
Architecture: x86_64
CPUs: 8
Total Memory: 29.46 GiB
Name: gcp-1-5.c.kontena-1162.internal
ID: CTTM:L5IQ:K24M:GNOZ:JOMO:CZEG:Z756:EI3M:ZGPS:HYPO:SXVO:GIO5
Docker Root Dir: /var/lib/docker
Debug mode (client): false
Debug mode (server): false
Registry: https://index.docker.io/v1/
running in google cloud platform.
Since this is a new bug, let's move this conversation over to https://github.com/coreos/bugs/issues/1654.
systemctl restart docker totally worked for me (which is definitely nicer than a reboot).
After a while of running various containers for no apparent reason docker ps hangs. All other docker commands work
Docker info:
Server: Version: 1.8.3 API version: 1.20 Go version: go1.5.1 Git commit: cedd534-dirty Built: Fri Oct 16 04:20:25 UTC 2015 OS/Arch: linux/amd64
Containers: 23 Images: 520 Storage Driver: overlay Backing Filesystem: extfs Execution Driver: native-0.2 Logging Driver: json-file Kernel Version: 4.2.2-coreos Operating System: CoreOS 835.1.0 CPUs: 8 Total Memory: 31.46 GiB ID: KOW4:OCYI:PTET:G2F7:BPHH:Y5GU:XMXT:U2D3:5SKA:MO63:SE2Y:I42C
I ran strace docker ps via toolbox to see if there's a clue of what's going on and it produced
So it looks like it's writing the get command but the server isn't responding.