Open Eshakk opened 7 years ago
This looks like a go runtime bug (fatal error: free list corrupted
).
I'm updating containerd to be built with go1.7 which should hopefully fix this.
Thank you very much for the response @euank . We are frequently hitting this scenario. Would you please let us know approximately when can we expect this in stable channel ?
@euank Another instance of this issue: containerd_crash.txt
fatal error: bad g->status in ready
, which is probably fixed by https://github.com/golang/go/commit/b902a63ade47cf69218c9b38c9a783fc8cfc3041
That fix is included in 1.8.x, and was also backported to 1.7.5 https://github.com/golang/go/issues/18708
We should probably backport a go1.7.5 update to stable and beta for our next releases of them.
This is also a serious issue for us. We need to manually restart all affected containerd
s and clean up the resulting inconsistencies. :(
We are seeing these crashes more frequently now. Every time as a work around if we have to restart the node, we have to drop all the activities currently on them :(. We have monitoring services in place to do that, but sometimes, after containerd crash, docker runs cleanly as if nothing is changed. It takes a while for docker to become un-responsive.
@BugRoger Are you also seeing any CPU/Memory spike before the containerd crash ?
We're not monitoring containerd directly. Our monitoring just picks up the resulting fallout: Kubelet unresponsive, Node gets evicted... So no, I havent seen that. If it's a bug in Golang, a memory spike is likely a coincidence or last hurrah.
Anectodically, this just resolves itself by restarting containerd. I'm wondering why the containerd.service
doesn't restart on its own. I added /etc/systemd/system/containerd.service.d/10-restart.conf
:
[Service]
Restart=always
Whether there's a deeper reasoning behind this omission, I don't know, but it helps with a segfaulting process until a proper fix lands in stable... :)
Closing, since Go 1.7.5 (with this fix) was used to build containerd in the stable release that should start rolling out tomorrow. Beta and alpha will be updated later in the week. Please reopen if the problem persists.
@dm0- Can we reopen this? I just got a another crash on 1353.6.0 which contains this above changes.
This gist contains the full beauty: https://gist.github.com/databus23/e2a4e2e5ec12cc07667d1a58c43cd0ff
It start with this:
Apr 26 18:52:16 containerd[2286]: runtime: newstack sp=0xc420087710 stack=[0xc420087000, 0xc4200877e0]
Apr 26 18:52:16 containerd[2286]: morebuf={pc:0xc9cfd8 sp:0xc420087718 lr:0x0}
Apr 26 18:52:16 containerd[2286]: sched={pc:0x444e93 sp:0xc420087710 lr:0x0 ctxt:0x0}
Apr 26 18:52:16 containerd[2286]: fatal error: runtime: stack split at bad time
Apr 26 18:52:16 containerd[2286]: fatal error: workbuf is not empty
Apr 26 18:52:16 containerd[2286]: fatal error: unexpected signal during runtime execution
containerd
just crashed on one of the minions in our production kubernetes cluster:
Apr 29 00:57:11 ip-10-72-34-115.us-west-2.compute.internal containerd[1738]: acquirep: p->m=859539461120(126) p->status=1
Apr 29 00:57:11 ip-10-72-34-115.us-west-2.compute.internal containerd[1738]: fatal error: acquirep: invalid p state
Apr 29 00:57:11 ip-10-72-34-115.us-west-2.compute.internal containerd[1738]: runtime: nonempty check fails b.log[0]= 0 b.log[1]= 0 b.log[2]= 0 b.log[3]= 0
Apr 29 00:57:11 ip-10-72-34-115.us-west-2.compute.internal containerd[1738]: fatal error: workbuf is empty
This is causing catastrophic errors for us. We need to ongoingly and manually recover node after node. Some of our applications (especially databases clients) have a hard time recovering from this problem. There must also be a contrak issue in Kubernetes related to this.
Any chance we can roll back to a stable containerd/docker until this can be resolved?
Just logged another crash. Interesting enough with a new error this time:
May 03 16:11:34 containerd[19317]: fatal error: exitsyscall: syscall frame is no longer valid
May 03 16:11:34 containerd[19317]: runtime stack:
May 03 16:11:34 containerd[19317]: runtime.throw(0x8925d6, 0x2d)
May 03 16:11:34 containerd[19317]: /usr/lib/go1.7/src/runtime/panic.go:566 +0x95
May 03 16:11:34 containerd[19317]: runtime.exitsyscall.func1()
May 03 16:11:34 containerd[19317]: /usr/lib/go1.7/src/runtime/proc.go:2470 +0x36
May 03 16:11:34 containerd[19317]: runtime.systemstack(0xc420b1c380)
May 03 16:11:34 containerd[19317]: /usr/lib/go1.7/src/runtime/asm_amd64.s:298 +0x79
May 03 16:11:34 containerd[19317]: runtime.mstart()
May 03 16:11:34 containerd[19317]: /usr/lib/go1.7/src/runtime/proc.go:1079
Full trace: https://gist.github.com/BugRoger/d4735f2390274f9bbda68d1a503bb43d
I've been unable to reproduce this, and the stack traces don't make it obvious to me what's actually causing these crashes.
If anyone has a reproduction or more information about what sort of setup leads to this, it would be appreciated!
@BugRoger I don't think rolling back docker/containerd helps here necessarily. If these started recently, well, docker/containerd haven't changed from 1.12.x for a bit so there's not much to roll back to.
@euank We have only 10 containers running. Among these, there was one container logging aggressively to stdout(around 150k messages/30 seconds). We thought this might have lead to containerd crash, so we restricted the logging to a local log file and reduced number of messages logged too. This only reduced the frequency of containerd crashes. Below are most recent crashes we have seen:
If you need any further details or any specific details(like debug log) just before crash please let me know. I'll do the required config.
@euank Another instance of containerd crash: containerd_crash_may10.txt
Could you please let me know if you need any specific info ?
@Eshakk that crash references go1.6
compilation strings which means it happened on a previous Container Linux version, right? If you're seeing these across different Container Linux versions, please include the version along with the reports.
Containerd does support a --debug
flag (which can be added via a drop-in for the containerd.service
unit), and it's possible that will contain more helpful information
If there's anything in dmesg (e.g. oom messages, filesystem-related events, etc) that would be useful to know too!
I did try some heavily logging containers, but couldn't reproduce any crashes like these still.
Is there anything else unusual about your environment? Any other information which might help reproduce this would be welcome..
@euank Sorry for not including the details.
I'm attaching here details of crash which happened today morning(May 16 02:44 UTC).
docker info
Containers: 10
Running: 10
Paused: 0
Stopped: 0
Images: 16
Server Version: 1.12.6
Storage Driver: overlay
Backing Filesystem: extfs
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: null bridge overlay host
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Security Options: seccomp selinux
Kernel Version: 4.9.24-coreos
Operating System: Container Linux by CoreOS 1353.6.0 (Ladybug)
OSType: linux
Architecture: x86_64
CPUs: 46
Total Memory: 58.98 GiB
Name: Server01
ID: TMCK:VRAG:6YYY:EKUK:EVF6:FOC5:DFD5:2XDR:ASSW:PXRT:5YC6:6LPJ
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Insecure Registries:
127.0.0.0/8
docker version
:
Client:
Version: 1.12.6
API version: 1.24
Go version: go1.6.3
Git commit: d5236f0
Built: Tue Apr 25 02:09:35 2017
OS/Arch: linux/amd64
Server:
Version: 1.12.6
API version: 1.24
Go version: go1.6.3
Git commit: d5236f0
Built: Tue Apr 25 02:09:35 2017
OS/Arch: linux/amd64
In the stack trace I'm seeing go1.7 version ??
dmesg
:
dmesg.txt
journalctl
:
journalctl.txt
dockerd config
:
dockerd_config.txt
We do not have anything unusual in our network config. We are planning to add debug to containerd. Will post the logs with debug as soon as I get them.
From the release notes this bug is supposedly fixed in 1395.0.0
.
From this issue and all referenced ones its not clear to me how this was fixed. If this was fixed why is this bug still open?
@euank Could you shine some light on this?
Ah sorry. Now I see this was mentioned as fixed in all channesl when bumping to go 1.7 which didn't fix it. Disregard my previous comment.
@euank We have enabled debug flag for containerd service. Below are the logs captured in the recent containerd crash.
docker info
:
Containers: 11
Running: 11
Paused: 0
Stopped: 0
Images: 17
Server Version: 1.12.6
Storage Driver: overlay
Backing Filesystem: extfs
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: null host overlay bridge
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Security Options: seccomp selinux
Kernel Version: 4.9.24-coreos
Operating System: Container Linux by CoreOS 1353.8.0 (Ladybug)
OSType: linux
Architecture: x86_64
CPUs: 46
Total Memory: 58.98 GiB
Name: Server01
ID: XW6K:MQM5:SLJ5:57JF:C52N:G6J4:FT3R:NGAP:7NBC:KVF3:MHYM:GEWH
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Insecure Registries:
127.0.0.0/8
docker version
:
Client:
Version: 1.12.6
API version: 1.24
Go version: go1.6.3
Git commit: d5236f0
Built: Tue May 30 23:15:08 2017
OS/Arch: linux/amd64
Server:
Version: 1.12.6
API version: 1.24
Go version: go1.6.3
Git commit: d5236f0
Built: Tue May 30 23:15:08 2017
OS/Arch: linux/amd64
journalctl
:
containerd_crash_jun01.txt
On another node, we have also seen a problem with containerd while rebooting cores. This might help:
docker info
:
Containers: 11
Running: 11
Paused: 0
Stopped: 0
Images: 17
Server Version: 1.12.6
Storage Driver: overlay
Backing Filesystem: extfs
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: host bridge overlay null
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Security Options: seccomp selinux
Kernel Version: 4.9.24-coreos
Operating System: Container Linux by CoreOS 1353.8.0 (Ladybug)
OSType: linux
Architecture: x86_64
CPUs: 70
Total Memory: 58.98 GiB
Name: Server02
ID: PYQ3:JC2K:TNDY:QT6J:5RCU:6V64:TRPC:E44I:5WUB:B3RW:V3VF:GLGI
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Insecure Registries:
127.0.0.0/8
docker version
:
Client:
Version: 1.12.6
API version: 1.24
Go version: go1.6.3
Git commit: d5236f0
Built: Tue May 30 23:15:08 2017
OS/Arch: linux/amd64
Server:
Version: 1.12.6
API version: 1.24
Go version: go1.6.3
Git commit: d5236f0
Built: Tue May 30 23:15:08 2017
OS/Arch: linux/amd64
containerd_stop_sigterm_timedout.txt
Could you please point to documentation, if any, to understand containerd debug information ?
Please let me know if you need any further information.
@euank Could you please let me know any update on this ?
@Eshakk As reported by @BugRoger we are also seeing this bug frequently. I happen to notice that your machines have a lot of CPUs. We are also having a lot of CPUs (80) on our barematel nodes.
The fact that this seems to be a go runtime bug and not many people seem to suffer from it maybe is only triggered by a sufficiently large GOMAXPROCS
setting which defaults to the number of CPUs AFAIK.
It might be worth reducing GOMAXPROCS
for the containerd process and see if that improves things.
@zihaoyu Do you also have a lot of CPUs in the nodes where you are experiencing those crashes?
i am also seeing this on 1.12.6 on Centos 7. Has this issue been resolved and is the fix available in 1.13.1 ?
I am also experiencing this crash across nodes running docker version 1.12.6 on CoreOS 1353.7.0.
@boardcode Please check my last comment in this issue: https://github.com/moby/moby/issues/32537
.
Unfortunately, this issue has not been successfully reproduced by anyone.
We had this issue and used @databus23's suggestion about limiting GOMAXPROCS. After creating a systemd drop-in for containerd which sets the GOMAXPROCS environmental variable to a lower value, we have not seen this issue occur again (2w+)
I also encountered this issue on a machine with 16 cores and setting GOMAXPROCS resolved the issue
I think all the evidence here still points to this being a go runtime bug. It's possible it's https://github.com/golang/go/issues/20427, which resulted in these sorts of random crashes; there are certainly some similarities here.
Our next step should be to start building containerd with a newer Go version and see if anyone still hits this.
Issue Report
Bug
We are observing containerd crashes across multiple nodes randomly. Each node have around 8 to 10 containers. containerd is spiking to ~100% CPU usage and after sometime it is crashing by killing all container-shim processes. Our node is able to work as expected when containerd is down. But as a result, docker is becoming unresponsive after couple of days.
We have no idea about how to reproduce this since this is happening on some random nodes. All the nodes in our network have same configuration.
Container Linux Version
docker version:
docker info:
Docker and containerd journalctl logs: sudo journalctl --no-pager -b -u docker -u containerd journal_b_docker.txt
Docker daemon journalctl logs using sigusr1 signal at the time of containerd crash: sudo kill -s SIGUSR1
journal_dockerd_sigusr1.txt
If you need any specific info, I'll be happy to provide.