Closed rkononov closed 7 years ago
We are also seeing this. We wrapped phantomjs with a timeout script ( http://www.bashcookbook.com/bashinfo/source/bash-4.0/examples/scripts/timeout3 ) which seemed to help but can reproduce this easily on deploys when lots of containers are running (300+). It looks like the container does not shutdown because phantomjs is locked and docker gets stuck trying to unregister the net device. Even after the container processes are long gone something in docker hangs on to the device.
I saw a bug upstream somewhere but I cant seem to find it now, and I have seen this behavior on ubuntu and debian as well, so I am assuming that this is in dockers hands but someone may be able to follow up on that better than me.
Upstream ticket : https://github.com/docker/docker/issues/5618
Issues is widespread across docker installs on multiple OS's
This issue requires a reboot of a node, which messes everything up.
@rkononov are you running ubuntu within your containers? There is a long running ubuntu bug report, with activity here : https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1403152
@popsikle yep we're using ubuntu
+1, using ubuntu. Very critical bug unfortunately
Is this still an issue with the later versions of CoreOS and Docker?
Closing due to inactivity.
It looks like this may be a kernel bug with a patch released upstream. Though I haven't tested it myself.
I'm having this bug with a home-compiled Kernel 4.3 running on KVM + Debian8 May I help in testing/finding a solution? (If so, how?)
Also experiencing this issue.
CentOS 7.2 running 4.3.3 (elrepo) kernel, guest is Debian running nginx (nginx:latest official stock container).
Please let me know if there is any way I can help.
Having this problem on CoreOS alpha 921.0.0, on a cluster brought up by your very own kube-aws
tool.
@iameli Can you give 928.0.0 a shot? This has the 4.4 kernel with the patch mentioned by @binarybana.
kudos!!! I have been seeing this all the time!
@crawford Sure. The cluster is up and running, I'll check back in here if it breaks.
@crawford I took over for @iameli. FYI, 928.0.0 still exhibits the problem. Happened to one node after about 6 days of uptime. The other node in the cluster is fine, strangely.
I did notice influxdb and heapster running out of their allocated cgroup memory and getting OOM-killed, so I scaled both of those down to zero (since we don't need them as far as I know).
The kernel message is a bit different this time... I don't see a stack trace, but docker does go zombie.
Here's the tail end of dmesg
:
[ 102.608365] eth0: renamed from veth1ed5641
[ 102.642752] docker0: port 3(veth249bda1) entered forwarding state
[ 102.646182] docker0: port 3(veth249bda1) entered forwarding state
[ 103.078115] IPv6: eth0: IPv6 duplicate address fe80::42:aff:fe02:3904 detected!
[ 103.160124] IPv6: eth0: IPv6 duplicate address fe80::42:aff:fe02:3903 detected!
[ 116.767074] docker0: port 1(veth836fe1f) entered forwarding state
[ 117.567062] docker0: port 2(veth3d78f53) entered forwarding state
[ 117.663072] docker0: port 3(veth249bda1) entered forwarding state
[543431.489079] unregister_netdevice: waiting for vethe142506 to become free. Usage count = 1
[543441.543100] unregister_netdevice: waiting for vethe142506 to become free. Usage count = 1
[543451.593092] unregister_netdevice: waiting for vethe142506 to become free. Usage count = 1
[543461.648116] unregister_netdevice: waiting for vethe142506 to become free. Usage count = 1
[543471.693077] unregister_netdevice: waiting for vethe142506 to become free. Usage count = 1
[543481.738097] unregister_netdevice: waiting for vethe142506 to become free. Usage count = 1
[543491.793081] unregister_netdevice: waiting for vethe142506 to become free. Usage count = 1
[543501.839081] unregister_netdevice: waiting for vethe142506 to become free. Usage count = 1
[543511.884079] unregister_netdevice: waiting for vethe142506 to become free. Usage count = 1
[543521.935076] unregister_netdevice: waiting for vethe142506 to become free. Usage count = 1
[543531.980110] unregister_netdevice: waiting for vethe142506 to become free. Usage count = 1
[543542.025107] unregister_netdevice: waiting for vethe142506 to become free. Usage count = 1
[543552.070089] unregister_netdevice: waiting for vethe142506 to become free. Usage count = 1
[543562.115107] unregister_netdevice: waiting for vethe142506 to become free. Usage count = 1
[543572.160077] unregister_netdevice: waiting for vethe142506 to become free. Usage count = 1
[543582.218090] unregister_netdevice: waiting for vethe142506 to become free. Usage count = 1
[543592.263117] unregister_netdevice: waiting for vethe142506 to become free. Usage count = 1
[543602.309105] unregister_netdevice: waiting for vethe142506 to become free. Usage count = 1
[543612.354076] unregister_netdevice: waiting for vethe142506 to become free. Usage count = 1
have this issue on debian running a centos7 image.
hadoop@jin:~$ uname -a
Linux jin 4.3.0-0.bpo.1-amd64 #1 SMP Debian 4.3.5-1~bpo8+1 (2016-02-23) x86_64 GNU/Linux
hadoop@jin:~$ docker version
Client:
Version: 1.10.2
API version: 1.22
Go version: go1.5.3
Git commit: c3959b1
Built: Mon Feb 22 21:22:53 2016
OS/Arch: linux/amd64
Server:
Version: 1.10.2
API version: 1.22
Go version: go1.5.3
Git commit: c3959b1
Built: Mon Feb 22 21:22:53 2016
OS/Arch: linux/amd64
hadoop@jin:~$ sudo tail -n 100 /var/log/syslog
Mar 9 13:45:16 jin NetworkManager[624]:
This issue happens regularly in my CoreOS machines
$ docker info
Containers: 140
Images: 284
Server Version: 1.9.1
Storage Driver: overlay
Backing Filesystem: extfs
Execution Driver: native-0.2
Logging Driver: json-file
Kernel Version: 4.3.6-coreos
Operating System: CoreOS 899.13.0
CPUs: 4
Total Memory: 2.864 GiB
Name: dev-coreos-k8s_16
ID: ******
Username: *****
Registry: https://index.docker.io/v1/
$ dmesg
[ 3454.288789] veth15fe3b7: renamed from eth0
[ 3454.428263] docker0: port 8(veth9bc56d3) entered disabled state
[ 3455.853996] docker0: port 8(veth9bc56d3) entered forwarding state
[ 3456.006591] docker0: port 8(veth9bc56d3) entered forwarding state
[ 3456.306205] IPv6: veth15fe3b7: IPv6 duplicate address fe80::42:aff:fe02:4509 detected!
[ 3456.430095] docker0: port 8(veth9bc56d3) entered disabled state
[ 3456.486652] device veth9bc56d3 left promiscuous mode
[ 3456.531678] docker0: port 8(veth9bc56d3) entered disabled state
[ 3466.662046] unregister_netdevice: waiting for veth15fe3b7 to become free. Usage count = 1
[ 3476.940070] unregister_netdevice: waiting for veth15fe3b7 to become free. Usage count = 1
[ 3487.144036] unregister_netdevice: waiting for veth15fe3b7 to become free. Usage count = 1
[ 3497.360033] unregister_netdevice: waiting for veth15fe3b7 to become free. Usage count = 1
[ 3507.498082] unregister_netdevice: waiting for veth15fe3b7 to become free. Usage count = 1
[ 3517.621075] unregister_netdevice: waiting for veth15fe3b7 to become free. Usage count = 1
[ 3527.798045] unregister_netdevice: waiting for veth15fe3b7 to become free. Usage count = 1
[ 3538.028077] unregister_netdevice: waiting for veth15fe3b7 to become free. Usage count = 1
[ 3548.158067] unregister_netdevice: waiting for veth15fe3b7 to become free. Usage count = 1
[ 3558.345041] unregister_netdevice: waiting for veth15fe3b7 to become free. Usage count = 1
[ 3568.542075] unregister_netdevice: waiting for veth15fe3b7 to become free. Usage count = 1
[ 3578.749063] unregister_netdevice: waiting for veth15fe3b7 to become free. Usage count = 1
[ 3588.956034] unregister_netdevice: waiting for veth15fe3b7 to become free. Usage count = 1
[ 3599.095094] unregister_netdevice: waiting for veth15fe3b7 to become free. Usage count = 1
[ 3609.231037] unregister_netdevice: waiting for veth15fe3b7 to become free. Usage count = 1
[ 3619.422043] unregister_netdevice: waiting for veth15fe3b7 to become free. Usage count = 1
....
Is there a work around this issue?
Running into this on CoreOS 1010.1.0 with kubernetes 1.2.2 and docker 1.10.3
@spiffxp Are you using promiscuous mode or veth hairpin mode for the container bridge devices? cf: https://github.com/kubernetes/kubernetes/issues/20096
I'm seeing this on 1010.1.0 via EC2. Everything is at default. I'm pulling and running containers via cloud-config and they're stalling with this issue.
Still seeing this as well here in AWS
CoreOS stable (1010.1.0) Docker version 1.10.3, build 8acee1b
@binarybana whatever the defaults are in 1.2.2, using flags from https://github.com/samsung-cnct/kraken/blob/master/ansible/roles/k8s-kubelet/defaults/main.yaml#L2-L23
Been waiting for fix on this for last six months now :(
Host: ubuntu 14.04
Docker Image: centOS7
$ dmesg
[60244.438117] device vethdb2b4ec entered promiscuous mode [60244.438227] IPv6: ADDRCONF(NETDEV_UP): vethdb2b4ec: link is not ready [60244.438232] docker0: port 2(vethdb2b4ec) entered forwarding state [60244.438236] docker0: port 2(vethdb2b4ec) entered forwarding state [60244.438456] docker0: port 2(vethdb2b4ec) entered disabled state [60246.744069] unregister_netdevice: waiting for lo to become free. Usage count = 1 [60256.800077] unregister_netdevice: waiting for lo to become free. Usage count = 1 [60266.856081] unregister_netdevice: waiting for lo to become free. Usage count = 1 [60276.892073] unregister_netdevice: waiting for lo to become free. Usage count = 1 [60286.948071] unregister_netdevice: waiting for lo to become free. Usage count = 1 [60297.000070] unregister_netdevice: waiting for lo to become free. Usage count = 1 [60307.080072] unregister_netdevice: waiting for lo to become free. Usage count = 1 [60317.140069] unregister_netdevice: waiting for lo to become free. Usage count = 1 [60327.192074] unregister_netdevice: waiting for lo to become free. Usage count = 1 [60337.236076] unregister_netdevice: waiting for lo to become free. Usage count = 1 [60347.300075] unregister_netdevice: waiting for lo to become free. Usage count = 1 [60357.368074] unregister_netdevice: waiting for lo to become free. Usage count = 1 [60367.432071] unregister_netdevice: waiting for lo to become free. Usage count = 1 [60377.496073] unregister_netdevice: waiting for lo to become free. Usage count = 1 [60387.560073] unregister_netdevice: waiting for lo to become free. Usage count = 1 [60397.620072] unregister_netdevice: waiting for lo to become free. Usage count = 1 [60407.692074] unregister_netdevice: waiting for lo to become free. Usage count = 1 [60417.752076] unregister_netdevice: waiting for lo to become free. Usage count = 1 [60427.816072] unregister_netdevice: waiting for lo to become free. Usage count = 1 [60437.888072] unregister_netdevice: waiting for lo to become free. Usage count = 1 [60447.956075] unregister_netdevice: waiting for lo to become free. Usage count = 1 [60458.016072] unregister_netdevice: waiting for lo to become free. Usage count = 1 [60468.076072] unregister_netdevice: waiting for lo to become free. Usage count = 1 [60478.132072] unregister_netdevice: waiting for lo to become free. Usage count = 1 [60481.220084] INFO: task docker:1526 blocked for more than 120 seconds. [60481.223477] Not tainted 3.13.0-74-generic #118-Ubuntu [60481.226271] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [60481.230196] docker D ffff88078fcb3180 0 1526 1 0x00000000 [60481.230202] ffff8807635e9d40 0000000000000082 ffff880763ba1800 ffff8807635e9fd8 [60481.230204] 0000000000013180 0000000000013180 ffff880763ba1800 ffffffff81cdbce0 [60481.230207] ffffffff81cdbce4 ffff880763ba1800 00000000ffffffff ffffffff81cdbce8 [60481.230211] Call Trace: [60481.230218] [<ffffffff81729499>] schedule_preempt_disabled+0x29/0x70 [60481.230221] [<ffffffff8172b305>] __mutex_lock_slowpath+0x135/0x1b0 [60481.230225] [<ffffffff811a31f9>] ? __kmalloc+0x1e9/0x230 [60481.230227] [<ffffffff8172b39f>] mutex_lock+0x1f/0x2f [60481.230230] [<ffffffff81620e61>] copy_net_ns+0x71/0x130 [60481.230234] [<ffffffff8108fbc9>] create_new_namespaces+0xf9/0x180 [60481.230236] [<ffffffff8108fcc3>] copy_namespaces+0x73/0xa0 [60481.230239] [<ffffffff81065b96>] copy_process.part.26+0x9a6/0x16b0 [60481.230241] [<ffffffff81066a75>] do_fork+0xd5/0x340 [60481.230243] [<ffffffff81066d66>] SyS_clone+0x16/0x20 [60481.230245] [<ffffffff81735ab9>] stub_clone+0x69/0x90 [60481.230247] [<ffffffff8173575d>] ? system_call_fastpath+0x1a/0x1f [60488.200069] unregister_netdevice: waiting for lo to become free. Usage count = 1 [60498.256071] unregister_netdevice: waiting for lo to become free. Usage count = 1 [60508.320074] unregister_netdevice: waiting for lo to become free. Usage count = 1 [60518.392070] unregister_netdevice: waiting for lo to become free. Usage count = 1 [60528.452067] unregister_netdevice: waiting for lo to become free. Usage count = 1 [60538.508066] unregister_netdevice: waiting for lo to become free. Usage count = 1 [60548.572067] unregister_netdevice: waiting for lo to become free. Usage count = 1 [60558.624060] unregister_netdevice: waiting for lo to become free. Usage count = 1 [60568.672071] unregister_netdevice: waiting for lo to become free. Usage count = 1 [60578.720071] unregister_netdevice: waiting for lo to become free. Usage count = 1 [60588.788074] unregister_netdevice: waiting for lo to become free. Usage count = 1 [60598.848085] unregister_netdevice: waiting for lo to become free. Usage count = 1 [60601.228081] INFO: task docker:1526 blocked for more than 120 seconds. [60601.231498] Not tainted 3.13.0-74-generic #118-Ubuntu [60601.234917] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [60601.239668] docker D ffff88078fcb3180 0 1526 1 0x00000000 [60601.239672] ffff8807635e9d40 0000000000000082 ffff880763ba1800 ffff8807635e9fd8 [60601.239675] 0000000000013180 0000000000013180 ffff880763ba1800 ffffffff81cdbce0 [60601.239677] ffffffff81cdbce4 ffff880763ba1800 00000000ffffffff ffffffff81cdbce8 [60601.239680] Call Trace: [60601.239687] [<ffffffff81729499>] schedule_preempt_disabled+0x29/0x70 [60601.239690] [<ffffffff8172b305>] __mutex_lock_slowpath+0x135/0x1b0 [60601.239694] [<ffffffff811a31f9>] ? __kmalloc+0x1e9/0x230 [60601.239696] [<ffffffff8172b39f>] mutex_lock+0x1f/0x2f [60601.239699] [<ffffffff81620e61>] copy_net_ns+0x71/0x130 [60601.239703] [<ffffffff8108fbc9>] create_new_namespaces+0xf9/0x180 [60601.239706] [<ffffffff8108fcc3>] copy_namespaces+0x73/0xa0 [60601.239709] [<ffffffff81065b96>] copy_process.part.26+0x9a6/0x16b0 [60601.239711] [<ffffffff81066a75>] do_fork+0xd5/0x340 [60601.239712] [<ffffffff81066d66>] SyS_clone+0x16/0x20 [60601.239715] [<ffffffff81735ab9>] stub_clone+0x69/0x90 [60601.239717] [<ffffffff8173575d>] ? system_call_fastpath+0x1a/0x1f [60608.904068] unregister_netdevice: waiting for lo to become free. Usage count = 1 [60618.960068] unregister_netdevice: waiting for lo to become free. Usage count = 1 [60629.024094] unregister_netdevice: waiting for lo to become free. Usage count = 1 [60639.088078] unregister_netdevice: waiting for lo to become free. Usage count = 1
Does anyone have a workaround? An update on this would be appreciated.
does anyone have a reliable way to reproduce this, or can share their configurations on top of CoreOS for testing? also, if you are running into this regularly, can you reproduce it on CoreOS alpha 1032.1.0 with linux 4.5.2?
I'm experiencing the same issue with K8s on CoreOS Alpha. Docker on k8s-nodes stops responding when deleting big deployment (100 pods and more). Happens to one of the node at a time
NAME=CoreOS ID=coreos VERSION=1032.1.0 VERSION_ID=1032.1.0 BUILD_ID=2016-05-05-1941 PRETTY_NAME="CoreOS 1032.1.0 (MoreOS)" ANSI_COLOR="1;32" HOME_URL="https://coreos.com/" BUG_REPORT_URL="https://github.com/coreos/bugs/issues"
May 10 07:28:53 tectonic-mgmt2 kernel: unregister_netdevice: waiting for veth19618b9 to become free. Usage count = 1
veth pair veth19618b9 doesn't exist on the system.
I'm seeing similiar results as @lsztachanski on 899.15.0
using K8s. I've had this happen twice now when I'm using Jenkins Kubernetes to build about 15 containers on a single node and shut them down quickly on build finish.
May 13 00:01:41 ip-10-21-40-154.ec2.internal kernel: unregister_netdevice: waiting for veth7091e93 to become free. Usage count = 1```
@mischief I've had good luck reproducing this with a specific kubernetes Job mentioned here: https://github.com/docker/docker/pull/23175 . I can confirm that it's still happening with that job on what I believe is a relatively stock kubernetes cluster on a CoreOS 1053.2.0 node running 4.6.0 – I'm happy to share any other information you might find helpful too!
I'm seeing the same issue on Linux centos72_base 3.10.0-327.10.1.el7.x86_64. docker daemon processes are defunct.
I saw this with alpha 1081; I can reproduce it unfortunately frequently (but not totally reliably) by running the Kubernetes node_e2e suite (export IMAGE_PROJECT=coreos-cloud; export IMAGES=coreos-alpha-1081-1-0-v20160617; make test_e2e_node FOCUS="" REMOTE=true
will run it on GCE if you have appropriate credentials configured).
ctrl+c-ing the test partway through quite frequently results in hitting the issue (unregister_netdevice: waiting for veth7531386 to become free. Usage count = 1
)
Again. CoreOS stable (1068.8.0)
This is my workaround for now:
[Unit]
Description=CoreOS issue 254 (unregister_netdevice) workaround
[Service]
TimeoutStartSec=0
Restart=always
RestartSec=1min
ExecStart=/usr/bin/bash -c ' \
journalctl -k -f | grep -m 1 "unregister_netdevice: waiting for vethd"; \
until locksmithctl unlock; locksmithctl lock; do \
sleep 5; \
done; \
sudo reboot'
[X-Fleet]
Global=true
saved as coreos-issue-254-WR.service
Got same problem on CoreOS CoreOS 1010.6.0 (MoreOS) with kubernetes 1.2.4 docker 1.10.3
Seeing this issue as well. CoreOS 1068.8.0, Docker 1.10.3, kernel 4.6.3. I pulled some of the system logs if anybody is interested. I've been following at https://github.com/docker/docker/issues/5618.
Same issue here, CoreOS 1068.8.0, Docker 1.10.3, kernel 4.6.3. CoreOS sets veth hairpin mode when using flannel. Any workaround to set it to promiscious mode where the problem doesn't seem to happen ?
Same problem too:
[Fri Aug 19 21:57:24 2016] unregister_netdevice: waiting for lo to become free. Usage count = 1
[Fri Aug 19 21:57:34 2016] unregister_netdevice: waiting for lo to become free. Usage count = 1
[Fri Aug 19 21:57:44 2016] unregister_netdevice: waiting for lo to become free. Usage count = 1
[Fri Aug 19 21:57:55 2016] unregister_netdevice: waiting for lo to become free. Usage count = 1
cat /etc/os-release
NAME=CoreOS
ID=coreos
VERSION=1068.6.0
VERSION_ID=1068.6.0
BUILD_ID=2016-07-12-1826
PRETTY_NAME="CoreOS 1068.6.0 (MoreOS)"
ANSI_COLOR="1;32"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://github.com/coreos/bugs/issues"
same problem:
Message from syslogd@localhost at Aug 26 10:51:21 ... kernel:unregister_netdevice: waiting for lo to become free. Usage count = 1
[root@localhost single_vm_stuff]# uname -a Linux localhost.localdomain 4.5.2-1.el7.elrepo.x86_64 #1 SMP Wed Apr 20 15:17:10 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux
`[root@localhost single_vm_stuff]# docker version
Client:
Version: 1.10.3
API version: 1.22
Go version: go1.5.3
Git commit: 20f81dd
Built: Thu Mar 10 15:39:25 2016
OS/Arch: linux/amd64
Server: Version: 1.10.3 API version: 1.22 Go version: go1.5.3 Git commit: 20f81dd Built: Thu Mar 10 15:39:25 2016 OS/Arch: linux/amd64 `
Also experiencing this issue. CentOS Linux 7 (Core) 4.7.2-1.el7.elrepo.x86_64 #1 SMP Sat Aug 20 18:40:06 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux
docker --version Docker version 1.11.2, build b9f10c9
I have only been able to reproduce this when using --userland-proxy=false
, I more regularly experience the "unable to set hairpin mode: operation not supported" error as well, again only with --userland-proxy=false
.
We must consider --userland-proxy=false
unsupported at this time, and upstream is aware of the issue.
Is anyone seeing these problems without --userland-proxy=false
?
@dm0 identified a lack of bridge state change synchronization in docker in https://github.com/coreos/bugs/issues/1554 which is quite likely related to the problems mentioned in this bug.
Testing with --userland-proxy=false
by starting a few thousand containers using the patch from the issue mentioned above (dm0-/libnetwork@4343ba4c21f1a121f9e867efda3231a61dc5565e) resulted in no network problems or error messages. See docker/docker#26492 for tracking the issue upstream.
Fixed via https://github.com/coreos/docker/pull/29.
We are also experiencing this issue on CoreOS, while using the official nginx:latest container (Debian).
VERSION=1122.2.0
VERSION_ID=1122.2.0
BUILD_ID=2016-09-06-1449
PRETTY_NAME="CoreOS 1122.2.0 (MoreOS)"
@crawford, does this fix also target the latest stable version of CoreOS?
@pietervisser this fix is not present in Stable yet. What is now in Beta (1185.1.0) has the fix and should be promoted to Stable in two weeks. It would be great if you could take that for a spin and let us know if you run into any trouble.
I'm seeing issues which I believe are related:
Arch: Bare Metal (HP) CoreOS: 1185.1.0 Docker: 1.11.2
I'm handling this with a fleet job that greps for "unregister_netdevice: waiting for veth" and reboots (using locksmithctl to get a cluster lock first)...
Here are some of my log entries: https://gist.github.com/stensonb/5e54282c0d351f746bf738590464e342
Notice, the first detection of "unregister_netdevice: waiting for veth" happens at 21:28:27, but the machine isn't rebooted until at least 21:34:51 (taking almost 6 minutes to "nicely" reboot due to this issue).
I'm happy to provide additional help/logs if that'll help identify this issue.
@crawford thanks for your reply. I would really be able to help here, but we don't know the exact steps to reproduce this issue. Restarting the nginx-container on our production-setup does trigger the issue sometimes.
If you need additional info, please let me know!
Oct 18 13:10:20 coreos-node-01 kernel: unregister_netdevice: waiting for veth3b967c3 to become free. Usage count = 4
Oct 18 13:10:30 coreos-node-01 kernel: unregister_netdevice: waiting for veth3b967c3 to become free. Usage count = 4
Oct 18 13:10:40 coreos-node-01 kernel: unregister_netdevice: waiting for veth3b967c3 to become free. Usage count = 4
Oct 18 13:10:50 coreos-node-01 kernel: unregister_netdevice: waiting for veth3b967c3 to become free. Usage count = 4
Oct 18 13:11:00 coreos-node-01 kernel: unregister_netdevice: waiting for veth3b967c3 to become free. Usage count = 4
Oct 18 13:11:10 coreos-node-01 kernel: unregister_netdevice: waiting for veth3b967c3 to become free. Usage count = 4
Oct 18 13:11:20 coreos-node-01 kernel: unregister_netdevice: waiting for veth3b967c3 to become free. Usage count = 4
Oct 18 13:11:30 coreos-node-01 kernel: unregister_netdevice: waiting for veth3b967c3 to become free. Usage count = 4
Oct 18 13:11:40 coreos-node-01 kernel: unregister_netdevice: waiting for veth3b967c3 to become free. Usage count = 4
Oct 18 13:11:50 coreos-node-01 kernel: unregister_netdevice: waiting for veth3b967c3 to become free. Usage count = 4
Oct 18 13:12:00 coreos-node-01 kernel: unregister_netdevice: waiting for veth3b967c3 to become free. Usage count = 4
Oct 18 13:12:10 coreos-node-01 kernel: unregister_netdevice: waiting for veth3b967c3 to become free. Usage count = 4
Oct 18 13:12:21 coreos-node-01 kernel: unregister_netdevice: waiting for veth3b967c3 to become free. Usage count = 4
@crawford as per https://github.com/docker/docker/issues/5618#issuecomment-252420204 it seems the fix for this specific issue would be updating the version of the Linux kernel in CoreOS to 4.4.22 (the issue isn't specific to docker, it seems concurrent veth calls from any userland software could trigger it).
The work around in coreos/docker#29 may help here by reducing the frequency with with docker calls into the kernel concurrently (esp. with systemd), but ultimately this lockup is an issue deep inside the ip stack in the kernel (and docker blocking forever on stuck system calls).
@sethp-jive the relevant upstream commit (torvalds/linux@751eb6b6042a596b0080967c1a529a9fe98dac1d) is apparently now in 4.8+, which is in alpha 1214.0.0.
@crawford We're still receiving this on CoreOS-alpha-1262.0.0-hvm
- ami-8c76789b
which includes coreos/coreos-overlay#2237.
Dec 22 12:30:10 ip-10-65-73-60.ec2.internal kernel: Memory cgroup out of memory: Kill process 8771 (phantomjs) score 962 or sacrifice child
Dec 22 12:30:10 ip-10-65-73-60.ec2.internal kernel: Killed process 8771 (phantomjs) total-vm:2566388kB, anon-rss:389016kB, file-rss:3956kB, shmem-rss:0kB
Dec 22 12:30:10 ip-10-65-73-60.ec2.internal kernel: oom_reaper: reaped process 8771 (phantomjs), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Dec 22 12:30:15 ip-10-216-196-165.ec2.internal kernel: unregister_netdevice: waiting for lo to become free. Usage count = 1
Dec 22 12:30:21 ip-10-65-73-60.ec2.internal kernel: unregister_netdevice: waiting for lo to become free. Usage count = 1
Dec 22 12:30:25 ip-10-216-196-165.ec2.internal kernel: unregister_netdevice: waiting for lo to become free. Usage count = 1
Dec 22 12:30:31 ip-10-65-73-60.ec2.internal kernel: unregister_netdevice: waiting for lo to become free. Usage count = 1
Dec 22 12:30:35 ip-10-216-196-165.ec2.internal kernel: unregister_netdevice: waiting for lo to become free. Usage count = 1
Dec 22 12:30:41 ip-10-65-73-60.ec2.internal kernel: unregister_netdevice: waiting for lo to become free. Usage count = 1
Dec 22 12:30:45 ip-10-216-196-165.ec2.internal kernel: unregister_netdevice: waiting for lo to become free. Usage count = 1
Dec 22 12:30:51 ip-10-65-73-60.ec2.internal kernel: unregister_netdevice: waiting for lo to become free. Usage count = 1
Dec 22 12:30:55 ip-10-216-196-165.ec2.internal kernel: unregister_netdevice: waiting for lo to become free. Usage count = 1
Dec 22 12:31:01 ip-10-65-73-60.ec2.internal kernel: unregister_netdevice: waiting for lo to become free. Usage count = 1
Dec 22 12:31:04 ip-10-0-176-241.ec2.internal kernel: unregister_netdevice: waiting for lo to become free. Usage count = 1
Dec 22 12:31:05 ip-10-216-196-165.ec2.internal kernel: unregister_netdevice: waiting for lo to become free. Usage count = 1
Dec 22 12:31:11 ip-10-65-73-60.ec2.internal kernel: unregister_netdevice: waiting for lo to become free. Usage count = 1
Dec 22 12:31:15 ip-10-216-196-165.ec2.internal kernel: unregister_netdevice: waiting for lo to become free. Usage count = 1
Dec 22 12:31:21 ip-10-65-73-60.ec2.internal kernel: unregister_netdevice: waiting for lo to become free. Usage count = 1
Dec 22 12:31:25 ip-10-216-196-165.ec2.internal kernel: unregister_netdevice: waiting for lo to become free. Usage count = 1
Dec 22 12:31:31 ip-10-65-73-60.ec2.internal kernel: unregister_netdevice: waiting for lo to become free. Usage count = 1
Dec 22 12:31:35 ip-10-216-196-165.ec2.internal kernel: unregister_netdevice: waiting for lo to become free. Usage count = 1
Dec 22 12:31:41 ip-10-65-73-60.ec2.internal kernel: unregister_netdevice: waiting for lo to become free. Usage count = 1
Dec 22 12:31:45 ip-10-216-196-165.ec2.internal kernel: unregister_netdevice: waiting for lo to become free. Usage count = 1
Dec 22 12:31:51 ip-10-65-73-60.ec2.internal kernel: unregister_netdevice: waiting for lo to become free. Usage count = 1
Dec 22 12:31:55 ip-10-216-196-165.ec2.internal kernel: unregister_netdevice: waiting for lo to become free. Usage count = 1
Dec 22 12:32:02 ip-10-65-73-60.ec2.internal kernel: unregister_netdevice: waiting for lo to become free. Usage count = 1
Dec 22 12:32:05 ip-10-216-196-165.ec2.internal kernel: unregister_netdevice: waiting for lo to become free. Usage count = 1
Dec 22 12:32:12 ip-10-65-73-60.ec2.internal kernel: unregister_netdevice: waiting for lo to become free. Usage count = 1
Dec 22 12:32:15 ip-10-216-196-165.ec2.internal kernel: unregister_netdevice: waiting for lo to become free. Usage count = 1
Dec 22 12:32:22 ip-10-65-73-60.ec2.internal kernel: unregister_netdevice: waiting for lo to become free. Usage count = 1
Dec 22 12:32:25 ip-10-216-196-165.ec2.internal kernel: unregister_netdevice: waiting for lo to become free. Usage count = 1
Dec 22 12:32:32 ip-10-65-73-60.ec2.internal kernel: unregister_netdevice: waiting for lo to become free. Usage count = 1
Dec 22 12:32:35 ip-10-65-73-60.ec2.internal kernel: INFO: task exe:11686 blocked for more than 120 seconds.
Dec 22 12:32:35 ip-10-65-73-60.ec2.internal kernel: Not tainted 4.9.0-coreos #1
Dec 22 12:32:35 ip-10-65-73-60.ec2.internal kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 22 12:32:35 ip-10-65-73-60.ec2.internal kernel: exe D 0 11686 1978 0x00000080
Dec 22 12:32:35 ip-10-65-73-60.ec2.internal kernel: ffff9a5b8b3b7800 ffff9a5a383df080 ffff9a59994a8000 ffff9a5ad5e6d7c0
Dec 22 12:32:35 ip-10-65-73-60.ec2.internal kernel: ffff9a5ba0258300 ffffadd80225fc68 ffffffffb55c52ed 0000000000000000
Dec 22 12:32:35 ip-10-65-73-60.ec2.internal kernel: ffffadd80225fc38 00000000b55c5a6e ffffadd80225fc90 ffff9a5ad5e6d7c0
Dec 22 12:32:35 ip-10-65-73-60.ec2.internal kernel: Call Trace:
Dec 22 12:32:35 ip-10-65-73-60.ec2.internal kernel: [<ffffffffb55c52ed>] ? __schedule+0x23d/0x6e0
Dec 22 12:32:35 ip-10-65-73-60.ec2.internal kernel: [<ffffffffb55c57c6>] schedule+0x36/0x80
Dec 22 12:32:35 ip-10-65-73-60.ec2.internal kernel: [<ffffffffb55c5a6e>] schedule_preempt_disabled+0xe/0x10
Dec 22 12:32:35 ip-10-65-73-60.ec2.internal kernel: [<ffffffffb55c7749>] __mutex_lock_slowpath+0xb9/0x130
Dec 22 12:32:35 ip-10-65-73-60.ec2.internal kernel: [<ffffffffb55c77df>] mutex_lock+0x1f/0x30
Dec 22 12:32:35 ip-10-65-73-60.ec2.internal kernel: [<ffffffffb54b51cf>] copy_net_ns+0x9f/0x170
Dec 22 12:32:35 ip-10-65-73-60.ec2.internal kernel: [<ffffffffb509b56b>] create_new_namespaces+0x11b/0x1e0
Dec 22 12:32:35 ip-10-65-73-60.ec2.internal kernel: [<ffffffffb509b6a3>] copy_namespaces+0x73/0xa0
Dec 22 12:32:35 ip-10-65-73-60.ec2.internal kernel: [<ffffffffb5077cf4>] copy_process.part.32+0x974/0x1be0
Dec 22 12:32:35 ip-10-65-73-60.ec2.internal kernel: [<ffffffffb5079147>] ? _do_fork+0xd7/0x390
Dec 22 12:32:35 ip-10-65-73-60.ec2.internal kernel: [<ffffffffb5079147>] _do_fork+0xd7/0x390
Dec 22 12:32:35 ip-10-65-73-60.ec2.internal kernel: [<ffffffffb50794a9>] SyS_clone+0x19/0x20
Dec 22 12:32:35 ip-10-65-73-60.ec2.internal kernel: [<ffffffffb5003b3a>] do_syscall_64+0x5a/0x160
Dec 22 12:32:35 ip-10-65-73-60.ec2.internal kernel: [<ffffffffb55ca7eb>] entry_SYSCALL64_slow_path+0x25/0x25
After few hours of running phantomjs inside docker containers(we constantly rotating them) we started getting " unregister_netdevice: waiting for lo to become free. Usage count = 1" and systemd show that docker.service unit became unresponsive, restarting docker.service doesn't help.
callstack:
Coreos version
uname -a:
docker -H 0.0.0.0:2375 version:
also i've noticed that docker process became zombie:
699 ? Zsl 36:30 [docker] <defunct>
not sure that there is easy way to reproduce :(