coreos / bugs

Issue tracker for CoreOS Container Linux
https://coreos.com/os/eol/
146 stars 30 forks source link

docker ps hangs #965

Closed rwky closed 8 years ago

rwky commented 8 years ago

After a while of running various containers for no apparent reason docker ps hangs. All other docker commands work

Docker info:

Server: Version: 1.8.3 API version: 1.20 Go version: go1.5.1 Git commit: cedd534-dirty Built: Fri Oct 16 04:20:25 UTC 2015 OS/Arch: linux/amd64

Containers: 23 Images: 520 Storage Driver: overlay Backing Filesystem: extfs Execution Driver: native-0.2 Logging Driver: json-file Kernel Version: 4.2.2-coreos Operating System: CoreOS 835.1.0 CPUs: 8 Total Memory: 31.46 GiB ID: KOW4:OCYI:PTET:G2F7:BPHH:Y5GU:XMXT:U2D3:5SKA:MO63:SE2Y:I42C

I ran strace docker ps via toolbox to see if there's a clue of what's going on and it produced

socket(PF_LOCAL, SOCK_DGRAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
setsockopt(3, SOL_SOCKET, SO_BROADCAST, [1], 4) = 0
connect(3, {sa_family=AF_LOCAL, sun_path="/run/systemd/journal/socket"}, 30) = -1 ENOENT (No such file or directory)
close(3)                                = 0
ioctl(0, TCGETS, {B38400 opost isig icanon echo ...}) = 0
ioctl(1, TCGETS, {B38400 opost isig icanon echo ...}) = 0
stat("/root/.docker/config.json", 0xc82008a240) = -1 ENOENT (No such file or directory)
stat("/root/.dockercfg", 0xc82008a2d0)  = -1 ENOENT (No such file or directory)
futex(0xc8201a2110, FUTEX_WAKE, 1)      = 1
socket(PF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
setsockopt(3, SOL_SOCKET, SO_BROADCAST, [1], 4) = 0
connect(3, {sa_family=AF_LOCAL, sun_path="/media/root/var/run/docker.sock"}, 34) = 0
epoll_create1(EPOLL_CLOEXEC)            = 4
epoll_ctl(4, EPOLL_CTL_ADD, 3, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=172969648, u64=140205085380272}}) = 0
getsockname(3, {sa_family=AF_LOCAL, NULL}, [2]) = 0
getpeername(3, {sa_family=AF_LOCAL, sun_path="/var/run/docker.sock"}, [23]) = 0
futex(0xc8201a2110, FUTEX_WAKE, 1)      = 1
read(3, 0xc8200ab000, 4096)             = -1 EAGAIN (Resource temporarily unavailable)
write(3, "GET /v1.20/containers/json HTTP/"..., 93) = 93
epoll_wait(4, {{EPOLLOUT, {u32=172969648, u64=140205085380272}}}, 128, 0) = 1
futex(0x181f250, FUTEX_WAIT, 0, NULL

So it looks like it's writing the get command but the server isn't responding.

crawford commented 8 years ago

@yifan-gu you were showing me this behavior the other day, but I forget what was causing it. Can you refresh my memory?

yifan-gu commented 8 years ago

@crawford @rwky I reported in this docker issue. https://github.com/docker/docker/issues/13885#issuecomment-149720279

spiffxp commented 8 years ago

Running into this as well, see kubernetes/kubernetes#17525

mischief commented 8 years ago

@rwky have you experienced this after your initial report?

rwky commented 8 years ago

@mischief yep at least once a week.

mischief commented 8 years ago

@rwky any tips on reproducing it? is it intermittent, or are you doing something to work around it?

rwky commented 8 years ago

I don't know how to predictably replicate it, however the server in question spawns and destroys (using the --rm flag) several docker containers every minute. I suspect that just running a CoreOS server that does that for a few days will trigger the hang.

The only way to resolve is to reboot the server (you could probably kill and restart docker but rebooting does the job).

pierreozoux commented 8 years ago

I'm seeing the same problem now (I'm on IRC, and will post the content of IRC here), I'll have to reboot the server in 2 hours, if in the mean time, you want me to run commands, let me know! (I run CoreOS stable)

Ok, I'm rebooting now, here is the IRC discussion: https://framabin.org/?3b8e73977c597b06#7wIlBFnWRaGO0+o34QkfauXwhm2kSvbk+NInpxe6kes=

leonty commented 8 years ago

Same for me on docker 1.10.1 after intensive using of dockerui web app (https://hub.docker.com/r/crosbymichael/dockerui/).

vpal commented 8 years ago

I experienced similar hang with CoreOS stable (835.12.0). I could not kill -9 the docker daemon or any of its processes. Had to reboot the machine.

Okay, I have no idea if this is the same issue, but it is reproducible with this command: /usr/bin/docker run --pid=host --privileged -v /etc/kubernetes/manifests:/etc/kubernetes/manifests -v /var/run/docker.sock:/var/run/docker.sock --rm --name kubelet gcr.io/google_containers/hyperkube:v1.1.8 /hyperkube kubelet --v=2 --config=/etc/kubernetes/manifests

gopinatht commented 8 years ago

+1 for a rapid resolution of this issue on CoreOS 970.1.0 with Docker 1.10.1 with kubernetes 1.1.7. Is there an active investigation of the problem? There was a suggestion that it might be a kernel issue (https://github.com/docker/docker/issues/13885#issuecomment-190147123) but I am not sure how it is addressed or should be addressed in CoreOS.

crawford commented 8 years ago

@gopinatht can you give the Beta channel a shot? That has the upstream fix for device mapper.

gopinatht commented 8 years ago

@crawford Thanks for the info. I will give it a try.

Just to confirm, the alpha channel does not have the fix (I am using the absolute latest version of Alpha)? Also, docker info output says the storage driver is overlay: Containers: 26 Running: 10 Paused: 0 Stopped: 16 Images: 12 Server Version: 1.10.1 Storage Driver: overlay Backing Filesystem: extfs Execution Driver: native-0.2 Logging Driver: json-file Plugins: Volume: local Network: host bridge null Kernel Version: 4.4.1-coreos Operating System: CoreOS 970.1.0 (Coeur Rouge) OSType: linux Architecture: x86_64 CPUs: 2 Total Memory: 3.862 GiB Name: ip-1x-xx-xx-xx.ec2.internal ID: UVON:7XXW:5AZD:QJTG:2DAW:6Y52:CKN3:EVZZ:WESO:COSQ:4UGC:VKNO

crawford commented 8 years ago

Sorry, I missed the CoreOS version before. That kernel also has the patch. I doubt it will behave any differently.

gopinatht commented 8 years ago

@crawford No worries. Any other insights on where I can investigate? I am kind of stuck on what I should do next. This issue has become a blocker for my team using a CoreOS/kubernetes/docker solution.

gopinatht commented 8 years ago

Guys, I logged an issue with Docker team and they are indicating that the docker build on CoreOS is different enough (different Go version and custom build on CoreOS) (https://github.com/docker/docker/issues/20871#issuecomment-191397335) for this to be more of a CoreOS issue.

Help or pointers from CoreOS team would be appreciated.

crawford commented 8 years ago

I really don't know where to start on this one. We are going to bump to Docker 1.10.2 in the next Alpha and then bump to Go 1.5.3 in the following one. So, by this time next week, we will have a mostly vanilla Docker shipping in CoreOS. Given how long this bug has been open, I don't expect the version bump to change anything, but at least we'll be able to debug with the Docker folks a bit easier. Let's reevaluate the situation next week.

vpal commented 8 years ago

To help debugging. If I add the -v /sys:/sys:ro option to the command I mentioned above the kubelet container and docker daemon does not hang. So the full command that does not seem to hang: /usr/bin/docker run --pid=host --privileged -v /sys:/sys:ro -v /etc/kubernetes/manifests:/etc/kubernetes/manifests -v /var/run/docker.sock:/var/run/docker.sock --rm --name kubelet gcr.igoogle_containers/hyperkube:v1.1.8 /hyperkube kubelet --v=2 --config=/etc/kubernetes/manifests

gopinatht commented 8 years ago

@vpal Thanks for this info. My system is a little different in that I am not using hyperkube on my AWS cluster. But I wonder why mounting /sys as read only mount should solve the docker hang issue. I am not able to make the connection.

gopinatht commented 8 years ago

@crawford I investigated the issue further and posted the results of my investigation there: https://github.com/docker/docker/issues/20871#issuecomment-192081335

FYI.

gopinatht commented 8 years ago

I asked this question on the docker bug page but thought it would be relevant here as well: Could this issue be related in any way to the hairpin NAT setting? Is is possible there is some sort of thread deadlocks on a device? When I ran systemctl on my CoreOS distribution, it did not show any device with name vethbfe8f8f. Could that be the source of the issue?

crawford commented 8 years ago

@gopinatht as of 983.0.0, we are shipping Docker built with Go 1.5.3, so that should eliminate a few variables.

gopinatht commented 8 years ago

@crawford Thanks for the heads-up. I will check it out.

abhisheyke commented 8 years ago

I came across same issue with docker 1.8.3. I noticed docker built with Go Version 1.4.2 . is it a known bug in docker 1.8.3 ? @gopinatht any luck using docker built with higher go version? Thanks

[root@abhi ]# docker version Client: Version: 1.8.3 API version: 1.20 Go version: go1.4.2 Git commit: f4bf5c7 Built: Mon Oct 12 06:06:01 UTC 2015 OS/Arch: linux/amd64

Server: Version: 1.8.3 API version: 1.20 Go version: go1.4.2 Git commit: f4bf5c7 Built: Mon Oct 12 06:06:01 UTC 2015 OS/Arch: linux/amd64 [root@localhost abhi]# docker info Containers: 20 Images: 324 Storage Driver: devicemapper Pool Name: docker-8:7-268591717-pool Pool Blocksize: 65.54 kB Backing Filesystem: xfs Data file: /dev/vg-abhi/data Metadata file: /dev/vg-abhi/metadata Data Space Used: 16.12 GB Data Space Total: 107.4 GB Data Space Available: 91.25 GB Metadata Space Used: 26.04 MB Metadata Space Total: 5.369 GB Metadata Space Available: 5.343 GB Udev Sync Supported: true Deferred Removal Enabled: false Library Version: 1.02.93-RHEL7 (2015-01-28) Execution Driver: native-0.2 Logging Driver: json-file Kernel Version: 3.10.0-229.el7.x86_64 Operating System: CentOS Linux 7 (Core) CPUs: 48 Total Memory: 125.6 GiB Name: localhost.localdomain ID: K76J:2RP6:HKQX:NMJS:AA5D:EXY2:VGGD:5YB4:KRFI:3HW6:WWG4:6QDY WARNING: bridge-nf-call-iptables is disabled WARNING: bridge-nf-call-ip6tables is disabled

crawford commented 8 years ago

Is this still an issue with 1.10.3? I haven't seen this one in a while (in fact, I forgot about this bug).

rwky commented 8 years ago

@crawford I've not seen this on 1.10.3 so far as I can remember, the server I normally see it on runs the beta channel so has been on 1.10.3 for a while.

Just checked my logs, not seen it since May 3rd.

crawford commented 8 years ago

@rwky Thank you for checking. Let's go ahead and close this one. We can reopen it if it turns out to still be a problem.

feelobot commented 8 years ago

seeing this still:

core@ip-172-20-13-165 ~ $ docker info
Containers: 50
 Running: 42
 Paused: 0
 Stopped: 8
Images: 241
Server Version: 1.10.3
Storage Driver: overlay
 Backing Filesystem: extfs
Execution Driver: native-0.2
Logging Driver: json-file
Plugins:
 Volume: local
 Network: bridge null host
Kernel Version: 4.5.7-coreos
Operating System: CoreOS 1010.6.0 (MoreOS)
OSType: linux
Architecture: x86_64
CPUs: 16
Total Memory: 29.44 GiB
Name: ip-172-20-13-165.ec2.internal
ID: SS63:BOHR:326B:NZUQ:BOMB:43U2:XOLL:UCT7:BDOS:LBD3:DQJK:O52K
zihaoyu commented 8 years ago

We see this quite often on 1068.6.0 and Kubernetes 1.3.3. Please consider reopening this issue.

core@ip-10-72-16-158 ~ $ docker info
Containers: 135
 Running: 77
 Paused: 0
 Stopped: 58
Images: 59
Server Version: 1.10.3
Storage Driver: overlay
 Backing Filesystem: extfs
Execution Driver: native-0.2
Logging Driver: json-file
Plugins:
 Volume: local
 Network: bridge null host
Kernel Version: 4.6.3-coreos
Operating System: CoreOS 1068.6.0 (MoreOS)
OSType: linux
Architecture: x86_64
CPUs: 40
Total Memory: 157.4 GiB
Name: ip-10-72-16-158.us-west-2.compute.internal
ID: F6BA:V7OB:IYIQ:QGDZ:GQCP:2ZUW:A7S7:GE73:AG6Z:SLND:CFKU:A7R3

Also seeing these a lot in journal -u docker.service:

Jul 30 00:40:29 ip-10-72-16-158.us-west-2.compute.internal dockerd[43231]: time="2016-07-30T00:40:29.198751686Z" level=info msg="Container bb8b577470e1ec6be491414d7d39df767a113e4fb6efbc3178c7b73ea14ff15e failed to exit within 30 seconds of SIGTERM - using the force"
Jul 30 00:41:09 ip-10-72-16-158.us-west-2.compute.internal dockerd[43231]: time="2016-07-30T00:41:09.470564310Z" level=info msg="Container d5677abe86893968f12f1d3db15ef0097dbd218d59c52e4c06d809d8c0205664 failed to exit within 30 seconds of SIGTERM - using the force"
Jul 30 00:41:09 ip-10-72-16-158.us-west-2.compute.internal dockerd[43231]: time="2016-07-30T00:41:09.470599085Z" level=info msg="Container 58b3ac90b2fef33b7cb76731c2df1d56d5459be0e7f0312169e6d47b599f6d1d failed to exit within 30 seconds of SIGTERM - using the force"
Jul 30 00:41:09 ip-10-72-16-158.us-west-2.compute.internal dockerd[43231]: time="2016-07-30T00:41:09.470564635Z" level=info msg="Container 85d9338bddc0553f079651c737fa3c165df53ffe35a852e579a11eb1732316c0 failed to exit within 30 seconds of SIGTERM - using the force"
Jul 30 00:41:10 ip-10-72-16-158.us-west-2.compute.internal dockerd[43231]: time="2016-07-30T00:41:10.627024947Z" level=error msg="Handler for POST /containers/36b70ef81bd78b765f58245a49883b4e11b3ab291ce4e67535c4e9314921af10/stop returned error: Container already stopped"
rwky commented 8 years ago

This just happened to me again, however at the same time I'm experiencing https://github.com/docker/docker/issues/5618 which I suspect is the cause @feelobot @zihaoyu perhaps you are experiencing the same thing?

zihaoyu commented 8 years ago

@rwky Not sure if it is related, but I did see

Aug 04 11:10:00 ip-10-72-16-158.us-west-2.compute.internal kernel: unregister_netdevice: waiting for veth9c879be to become free. Usage count = 1```
rwky commented 8 years ago

Yep that looks like it's the problem.

nizar-m commented 8 years ago

This is happening for me too.

Docker version 1.10.3, build 3cd164c

CoreOS stable (1068.8.0)

Kernel Version: 4.6.3-coreos OS Image: CoreOS 1068.8.0 (MoreOS) Container Runtime Version: docker://1.10.3 Kubelet Version: v1.2.4.7+cb08580860a02a-dirty Kube-Proxy Version: v1.2.4.7+cb08580860a02a-dirty

docker info

Containers: 131 Running: 90 Paused: 0 Stopped: 41 Images: 17 Server Version: 1.10.3 Storage Driver: overlay Backing Filesystem: extfs Execution Driver: native-0.2 Logging Driver: json-file Plugins: Volume: local Network: null host bridge Kernel Version: 4.6.3-coreos Operating System: CoreOS 1068.8.0 (MoreOS) OSType: linux Architecture: x86_64 CPUs: 4 Total Memory: 13.7 GiB

dmesg shows the folowing

[163194.652062] unregister_netdevice: waiting for vethdd22a32 to become free. Usage count = 1

We are using Flannal driver

alph486 commented 8 years ago

Echo @rwky: reported details here: https://github.com/docker/docker/issues/5618#issuecomment-243846064 . I'm probably going to try @sercand 's suggestion (https://github.com/docker/docker/issues/5618#issuecomment-229878598)

It should probably also be mentioned that this of course affects, by default, clusters build by the kube-aws project.

mischief commented 8 years ago

sounds like a dup of https://github.com/coreos/bugs/issues/254

liquid-sky commented 7 years ago

@mischief, doesn't look like a dup, because we run container the same way as @rwky (every minute with --rm) and I can see that docker hangs exactly on the container, long before unregistered lo message happens. On some machines there's no unregistered lo message at all, but docker ps is hanging.

docker -D info
Containers: 6
 Running: 2
 Paused: 0
 Stopped: 4
Images: 6
Server Version: 1.11.2
Storage Driver: overlay
 Backing Filesystem: extfs
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: host bridge null
Kernel Version: 4.7.3-coreos-r2
Operating System: CoreOS 1185.3.0 (MoreOS)
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 7.8 GiB
Name: <hostname>
ID: GFHQ:Y6IV:XCQI:Y2NA:PCQN:DEII:OSPZ:LENL:OCFU:6CBI:MDFV:EMD7
Docker Root Dir: /var/lib/docker
Debug mode (client): true
Debug mode (server): false
Username: <username>
Registry: https://index.docker.io/v1/
matti commented 7 years ago

I don't have unregister_netdevice in dmesg, but docker ps hangs like this:

$ docker ps -n 17
# works

$ docker ps -n 18
# hangs

strace:

$ strace -s  128 docker ps -n 18
clock_gettime(CLOCK_MONOTONIC, {484793, 504192514}) = 0
futex(0xc820064108, FUTEX_WAKE, 1)      = 1
read(5, 0xc820400000, 4096)             = -1 EAGAIN (Resource temporarily unavailable)
write(5, "GET /v1.23/containers/json?limit=18 HTTP/1.1\r\nHost: \r\nUser-Agent: Docker-Client/1.11.2 (linux)\r\n\r\n", 98) = 98
futex(0xc820064108, FUTEX_WAKE, 1)      = 1
futex(0x22509c8, FUTEX_WAIT, 0, NULL

so $ curl --unix-socket /var/run/docker.sock http://localhost/v1.23/containers/json?limit=17 works, but curl --unix-socket /var/run/docker.sock http://dolantrump/v1.23/containers/json?limit=18 hangs.

Coreos: VERSION=1185.3.0 BUILD_ID=2016-11-01-0605

$ docker info
Containers: 52
 Running: 28
 Paused: 0
 Stopped: 24
Images: 110
Server Version: 1.11.2
Storage Driver: overlay
 Backing Filesystem: extfs
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge null host
Kernel Version: 4.7.3-coreos-r2
Operating System: CoreOS 1185.3.0 (MoreOS)
OSType: linux
Architecture: x86_64
CPUs: 8
Total Memory: 29.46 GiB
Name: gcp-1-5.c.kontena-1162.internal
ID: CTTM:L5IQ:K24M:GNOZ:JOMO:CZEG:Z756:EI3M:ZGPS:HYPO:SXVO:GIO5
Docker Root Dir: /var/lib/docker
Debug mode (client): false
Debug mode (server): false
Registry: https://index.docker.io/v1/

running in google cloud platform.

crawford commented 7 years ago

Since this is a new bug, let's move this conversation over to https://github.com/coreos/bugs/issues/1654.

klawed commented 7 years ago

systemctl restart docker totally worked for me (which is definitely nicer than a reboot).