coreos / bugs

Issue tracker for CoreOS Container Linux
https://coreos.com/os/eol/
147 stars 30 forks source link

Random containerd crashes #1909

Open Eshakk opened 7 years ago

Eshakk commented 7 years ago

Issue Report

Bug

We are observing containerd crashes across multiple nodes randomly. Each node have around 8 to 10 containers. containerd is spiking to ~100% CPU usage and after sometime it is crashing by killing all container-shim processes. Our node is able to work as expected when containerd is down. But as a result, docker is becoming unresponsive after couple of days.

We have no idea about how to reproduce this since this is happening on some random nodes. All the nodes in our network have same configuration.

Container Linux Version

$ cat /etc/os-release 
NAME="Container Linux by CoreOS"
ID=coreos
VERSION=1298.6.0
VERSION_ID=1298.6.0
BUILD_ID=2017-03-14-2119
PRETTY_NAME="Container Linux by CoreOS 1298.6.0 (Ladybug)"
ANSI_COLOR="38;5;75"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://github.com/coreos/bugs/issues"

docker version:

Client:
 Version:      1.12.6
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   d5236f0
 Built:        Tue Mar 14 21:10:49 2017
 OS/Arch:      linux/amd64

Server:
 Version:      1.12.6
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   d5236f0
 Built:        Tue Mar 14 21:10:49 2017
 OS/Arch:      linux/amd64

docker info:

Containers: 9
 Running: 9
 Paused: 0
 Stopped: 0
Images: 14
Server Version: 1.12.6
Storage Driver: overlay
 Backing Filesystem: extfs
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: null host bridge overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Security Options: seccomp selinux
Kernel Version: 4.9.9-coreos-r1
Operating System: Container Linux by CoreOS 1298.6.0 (Ladybug)
OSType: linux
Architecture: x86_64
CPUs: 70
Total Memory: 58.98 GiB
Name: Server01
ID: PYQ3:JC2K:TNDY:QT6J:5RCU:6V64:TRPC:E44I:5WUB:B3RW:V3VF:GLGI
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Insecure Registries:
 127.0.0.0/8

Docker and containerd journalctl logs: sudo journalctl --no-pager -b -u docker -u containerd journal_b_docker.txt

Docker daemon journalctl logs using sigusr1 signal at the time of containerd crash: sudo kill -s SIGUSR1 journal_dockerd_sigusr1.txt

If you need any specific info, I'll be happy to provide.

euank commented 7 years ago

This looks like a go runtime bug (fatal error: free list corrupted).

I'm updating containerd to be built with go1.7 which should hopefully fix this.

Eshakk commented 7 years ago

Thank you very much for the response @euank . We are frequently hitting this scenario. Would you please let us know approximately when can we expect this in stable channel ?

Eshakk commented 7 years ago

@euank Another instance of this issue: containerd_crash.txt

dockerd_with_sigusr1.txt

euank commented 7 years ago

fatal error: bad g->status in ready, which is probably fixed by https://github.com/golang/go/commit/b902a63ade47cf69218c9b38c9a783fc8cfc3041

That fix is included in 1.8.x, and was also backported to 1.7.5 https://github.com/golang/go/issues/18708

We should probably backport a go1.7.5 update to stable and beta for our next releases of them.

BugRoger commented 7 years ago

This is also a serious issue for us. We need to manually restart all affected containerds and clean up the resulting inconsistencies. :(

Eshakk commented 7 years ago

We are seeing these crashes more frequently now. Every time as a work around if we have to restart the node, we have to drop all the activities currently on them :(. We have monitoring services in place to do that, but sometimes, after containerd crash, docker runs cleanly as if nothing is changed. It takes a while for docker to become un-responsive.

@BugRoger Are you also seeing any CPU/Memory spike before the containerd crash ?

BugRoger commented 7 years ago

We're not monitoring containerd directly. Our monitoring just picks up the resulting fallout: Kubelet unresponsive, Node gets evicted... So no, I havent seen that. If it's a bug in Golang, a memory spike is likely a coincidence or last hurrah.

Anectodically, this just resolves itself by restarting containerd. I'm wondering why the containerd.service doesn't restart on its own. I added /etc/systemd/system/containerd.service.d/10-restart.conf:

[Service]
Restart=always

Whether there's a deeper reasoning behind this omission, I don't know, but it helps with a segfaulting process until a proper fix lands in stable... :)

dm0- commented 7 years ago

Closing, since Go 1.7.5 (with this fix) was used to build containerd in the stable release that should start rolling out tomorrow. Beta and alpha will be updated later in the week. Please reopen if the problem persists.

databus23 commented 7 years ago

@dm0- Can we reopen this? I just got a another crash on 1353.6.0 which contains this above changes.

This gist contains the full beauty: https://gist.github.com/databus23/e2a4e2e5ec12cc07667d1a58c43cd0ff

It start with this:

Apr 26 18:52:16 containerd[2286]: runtime: newstack sp=0xc420087710 stack=[0xc420087000, 0xc4200877e0]
Apr 26 18:52:16 containerd[2286]:         morebuf={pc:0xc9cfd8 sp:0xc420087718 lr:0x0}
Apr 26 18:52:16 containerd[2286]:         sched={pc:0x444e93 sp:0xc420087710 lr:0x0 ctxt:0x0}
Apr 26 18:52:16 containerd[2286]: fatal error: runtime: stack split at bad time
Apr 26 18:52:16 containerd[2286]: fatal error: workbuf is not empty
Apr 26 18:52:16 containerd[2286]: fatal error: unexpected signal during runtime execution
zihaoyu commented 7 years ago

containerd just crashed on one of the minions in our production kubernetes cluster:

Apr 29 00:57:11 ip-10-72-34-115.us-west-2.compute.internal containerd[1738]: acquirep: p->m=859539461120(126) p->status=1
Apr 29 00:57:11 ip-10-72-34-115.us-west-2.compute.internal containerd[1738]: fatal error: acquirep: invalid p state
Apr 29 00:57:11 ip-10-72-34-115.us-west-2.compute.internal containerd[1738]: runtime: nonempty check fails b.log[0]= 0 b.log[1]= 0 b.log[2]= 0 b.log[3]= 0
Apr 29 00:57:11 ip-10-72-34-115.us-west-2.compute.internal containerd[1738]: fatal error: workbuf is empty
BugRoger commented 7 years ago

This is causing catastrophic errors for us. We need to ongoingly and manually recover node after node. Some of our applications (especially databases clients) have a hard time recovering from this problem. There must also be a contrak issue in Kubernetes related to this.

Any chance we can roll back to a stable containerd/docker until this can be resolved?

BugRoger commented 7 years ago

Just logged another crash. Interesting enough with a new error this time:

May 03 16:11:34 containerd[19317]: fatal error: exitsyscall: syscall frame is no longer valid
May 03 16:11:34 containerd[19317]: runtime stack:
May 03 16:11:34 containerd[19317]: runtime.throw(0x8925d6, 0x2d)
May 03 16:11:34 containerd[19317]:         /usr/lib/go1.7/src/runtime/panic.go:566 +0x95
May 03 16:11:34 containerd[19317]: runtime.exitsyscall.func1()
May 03 16:11:34 containerd[19317]:         /usr/lib/go1.7/src/runtime/proc.go:2470 +0x36
May 03 16:11:34 containerd[19317]: runtime.systemstack(0xc420b1c380)
May 03 16:11:34 containerd[19317]:         /usr/lib/go1.7/src/runtime/asm_amd64.s:298 +0x79
May 03 16:11:34 containerd[19317]: runtime.mstart()
May 03 16:11:34 containerd[19317]:         /usr/lib/go1.7/src/runtime/proc.go:1079

Full trace: https://gist.github.com/BugRoger/d4735f2390274f9bbda68d1a503bb43d

euank commented 7 years ago

I've been unable to reproduce this, and the stack traces don't make it obvious to me what's actually causing these crashes.

If anyone has a reproduction or more information about what sort of setup leads to this, it would be appreciated!

@BugRoger I don't think rolling back docker/containerd helps here necessarily. If these started recently, well, docker/containerd haven't changed from 1.12.x for a bit so there's not much to roll back to.

Eshakk commented 7 years ago

@euank We have only 10 containers running. Among these, there was one container logging aggressively to stdout(around 150k messages/30 seconds). We thought this might have lead to containerd crash, so we restricted the logging to a local log file and reduced number of messages logged too. This only reduced the frequency of containerd crashes. Below are most recent crashes we have seen:

containerd_stack_trace.txt

containerd_stack_trace2.txt

If you need any further details or any specific details(like debug log) just before crash please let me know. I'll do the required config.

Eshakk commented 7 years ago

@euank Another instance of containerd crash: containerd_crash_may10.txt

Could you please let me know if you need any specific info ?

euank commented 7 years ago

@Eshakk that crash references go1.6 compilation strings which means it happened on a previous Container Linux version, right? If you're seeing these across different Container Linux versions, please include the version along with the reports.

Containerd does support a --debug flag (which can be added via a drop-in for the containerd.service unit), and it's possible that will contain more helpful information If there's anything in dmesg (e.g. oom messages, filesystem-related events, etc) that would be useful to know too!

I did try some heavily logging containers, but couldn't reproduce any crashes like these still.

Is there anything else unusual about your environment? Any other information which might help reproduce this would be welcome..

Eshakk commented 7 years ago

@euank Sorry for not including the details. I'm attaching here details of crash which happened today morning(May 16 02:44 UTC). docker info

Containers: 10
 Running: 10
 Paused: 0
 Stopped: 0
Images: 16
Server Version: 1.12.6
Storage Driver: overlay
 Backing Filesystem: extfs
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: null bridge overlay host
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Security Options: seccomp selinux
Kernel Version: 4.9.24-coreos
Operating System: Container Linux by CoreOS 1353.6.0 (Ladybug)
OSType: linux
Architecture: x86_64
CPUs: 46
Total Memory: 58.98 GiB
Name: Server01
ID: TMCK:VRAG:6YYY:EKUK:EVF6:FOC5:DFD5:2XDR:ASSW:PXRT:5YC6:6LPJ
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Insecure Registries:
 127.0.0.0/8

docker version:

Client:
 Version:      1.12.6
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   d5236f0
 Built:        Tue Apr 25 02:09:35 2017
 OS/Arch:      linux/amd64

Server:
 Version:      1.12.6
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   d5236f0
 Built:        Tue Apr 25 02:09:35 2017
 OS/Arch:      linux/amd64

In the stack trace I'm seeing go1.7 version ??

dmesg: dmesg.txt

journalctl: journalctl.txt

dockerd config: dockerd_config.txt

We do not have anything unusual in our network config. We are planning to add debug to containerd. Will post the logs with debug as soon as I get them.

databus23 commented 7 years ago

From the release notes this bug is supposedly fixed in 1395.0.0. From this issue and all referenced ones its not clear to me how this was fixed. If this was fixed why is this bug still open? @euank Could you shine some light on this?

databus23 commented 7 years ago

Ah sorry. Now I see this was mentioned as fixed in all channesl when bumping to go 1.7 which didn't fix it. Disregard my previous comment.

Eshakk commented 7 years ago

@euank We have enabled debug flag for containerd service. Below are the logs captured in the recent containerd crash. docker info:

Containers: 11
 Running: 11
 Paused: 0
 Stopped: 0
Images: 17
Server Version: 1.12.6
Storage Driver: overlay
 Backing Filesystem: extfs
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: null host overlay bridge
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Security Options: seccomp selinux
Kernel Version: 4.9.24-coreos
Operating System: Container Linux by CoreOS 1353.8.0 (Ladybug)
OSType: linux
Architecture: x86_64
CPUs: 46
Total Memory: 58.98 GiB
Name: Server01
ID: XW6K:MQM5:SLJ5:57JF:C52N:G6J4:FT3R:NGAP:7NBC:KVF3:MHYM:GEWH
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Insecure Registries:
 127.0.0.0/8

docker version:

Client:
 Version:      1.12.6
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   d5236f0
 Built:        Tue May 30 23:15:08 2017
 OS/Arch:      linux/amd64

Server:
 Version:      1.12.6
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   d5236f0
 Built:        Tue May 30 23:15:08 2017
 OS/Arch:      linux/amd64

journalctl: containerd_crash_jun01.txt

On another node, we have also seen a problem with containerd while rebooting cores. This might help: docker info:

Containers: 11
 Running: 11
 Paused: 0
 Stopped: 0
Images: 17
Server Version: 1.12.6
Storage Driver: overlay
 Backing Filesystem: extfs
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: host bridge overlay null
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Security Options: seccomp selinux
Kernel Version: 4.9.24-coreos
Operating System: Container Linux by CoreOS 1353.8.0 (Ladybug)
OSType: linux
Architecture: x86_64
CPUs: 70
Total Memory: 58.98 GiB
Name: Server02
ID: PYQ3:JC2K:TNDY:QT6J:5RCU:6V64:TRPC:E44I:5WUB:B3RW:V3VF:GLGI
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Insecure Registries:
 127.0.0.0/8

docker version:

Client:
 Version:      1.12.6
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   d5236f0
 Built:        Tue May 30 23:15:08 2017
 OS/Arch:      linux/amd64

Server:
 Version:      1.12.6
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   d5236f0
 Built:        Tue May 30 23:15:08 2017
 OS/Arch:      linux/amd64

containerd_stop_sigterm_timedout.txt

Could you please point to documentation, if any, to understand containerd debug information ?

Please let me know if you need any further information.

Eshakk commented 7 years ago

@euank Could you please let me know any update on this ?

databus23 commented 7 years ago

@Eshakk As reported by @BugRoger we are also seeing this bug frequently. I happen to notice that your machines have a lot of CPUs. We are also having a lot of CPUs (80) on our barematel nodes.

The fact that this seems to be a go runtime bug and not many people seem to suffer from it maybe is only triggered by a sufficiently large GOMAXPROCS setting which defaults to the number of CPUs AFAIK.

It might be worth reducing GOMAXPROCS for the containerd process and see if that improves things.

@zihaoyu Do you also have a lot of CPUs in the nodes where you are experiencing those crashes?

krmayankk commented 7 years ago

i am also seeing this on 1.12.6 on Centos 7. Has this issue been resolved and is the fix available in 1.13.1 ?

boardcode commented 7 years ago

I am also experiencing this crash across nodes running docker version 1.12.6 on CoreOS 1353.7.0.

Eshakk commented 7 years ago

@boardcode Please check my last comment in this issue: https://github.com/moby/moby/issues/32537. Unfortunately, this issue has not been successfully reproduced by anyone.

brenix commented 7 years ago

We had this issue and used @databus23's suggestion about limiting GOMAXPROCS. After creating a systemd drop-in for containerd which sets the GOMAXPROCS environmental variable to a lower value, we have not seen this issue occur again (2w+)

pfhayes commented 7 years ago

I also encountered this issue on a machine with 16 cores and setting GOMAXPROCS resolved the issue

euank commented 6 years ago

I think all the evidence here still points to this being a go runtime bug. It's possible it's https://github.com/golang/go/issues/20427, which resulted in these sorts of random crashes; there are certainly some similarities here.

Our next step should be to start building containerd with a newer Go version and see if anyone still hits this.