coreos / bugs

Issue tracker for CoreOS Container Linux
https://coreos.com/os/eol/
146 stars 30 forks source link

Pods in kubernetes end up in crashloop because of rkt gabbage collector after 12hours mark #1965

Open oba11 opened 7 years ago

oba11 commented 7 years ago

Issue Report

Bug

Container Linux Version

$ cat /etc/os-release
NAME="Container Linux by CoreOS"
ID=coreos
VERSION=1353.7.0
VERSION_ID=1353.7.0
BUILD_ID=2017-04-26-2154
PRETTY_NAME="Container Linux by CoreOS 1353.7.0 (Ladybug)"
ANSI_COLOR="38;5;75"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://issues.coreos.com"

Environment

Kubernetes Version: 1.6.3 AWS HVM

Expected Behavior

Infinite running of kubernetes pods while relying on flannel as the network backend.

Actual Behavior

After around 12hours, noticed that all the pods on the node would end up in a crashloop. After couple of days of debugging, noticed that it was caused by the rkt garbage collector timer.

Not sure why but noticed that the rkt pod created by flannel-docker-opts.service only stops after successful execution.

Even though I'm using docker as container backend, kubelet tries to cleanup the rkt pods which was unsuccessful but at exactly 12hrs mark after the node was started, all containers on the kubelet starts ending up in crashloop until the kubelet is restarted.

A hack fix was do delete the exited rkt pods created by flannel-docker-opts.service and the pods on kubelet has been running without hiccup for a while now

coreos:
  units:
    - name: flannel-docker-opts.service
       drop-ins:
          - name: 10-custom.conf
             content: |
             [Service]
             Environment="FLANNEL_IMAGE_TAG=v0.7.1"
             TimeoutStartSec=120
             ExecStop=-/usr/bin/rkt rm --uuid-file=/var/lib/coreos/flannel-wrapper2.uuid
rkt list --full
UUID          APP IMAGE NAME      IMAGE ID    STATE   CREATED       STARTED       NETWORKS
0beb4683-7c3f-4b30-89c3-1673dc6d19dd  flannel quay.io/coreos/flannel:v0.7.1 sha512-f5a1ca0ed2aa running   2017-05-12 13:35:45 +0000 UTC 2017-05-12 13:35:45 +0000 UTC
1ef13c00-f574-44b9-a125-c18f6c36e146  flannel quay.io/coreos/flannel:v0.7.1 sha512-f5a1ca0ed2aa exited garbage  2017-05-12 13:36:06 +0000 UTC 2017-05-12 13:36:06 +0000 UTC
systemctl list-timers --all
NEXT                         LEFT     LAST                         PASSED       UNIT                         ACTIVATES
Sat 2017-05-13 13:35:31 UTC  7h left  Sat 2017-05-13 01:35:31 UTC  4h 15min ago rkt-gc.timer                 rkt-gc.service
Sat 2017-05-13 13:50:10 UTC  7h left  Fri 2017-05-12 13:50:10 UTC  16h ago      systemd-tmpfiles-clean.timer systemd-tmpfiles-clean.service
Sun 2017-05-14 00:00:00 UTC  18h left Sat 2017-05-13 00:00:31 UTC  5h 50min ago logrotate.timer              logrotate.service
Sun 2017-05-14 04:00:00 UTC  22h left Sat 2017-05-13 04:00:28 UTC  1h 50min ago docker-gc.timer              docker-gc.service
May 12 13:35:02 localhost systemd[1]: Starting Garbage Collection for rkt...
May 12 13:35:04 ip-10-1-1-1.compute.internal systemd[1]: Started Garbage Collection for rkt.
May 12 13:35:10 ip-10-1-1-1.compute.internal systemd[1]: Starting Garbage Collection for rkt...
May 12 13:35:10 ip-10-1-1-1.compute.internal systemd[1]: Started Garbage Collection for rkt.
May 12 13:36:13 ip-10-1-1-1.compute.internal systemd[1]: Starting Kubelet...
May 13 01:35:31 ip-10-1-1-1.compute.internal systemd[1]: Starting Garbage Collection for rkt...
May 13 01:35:31 ip-10-1-1-1.compute.internal rkt[16441]: gc: moving pod "1ef13c00-f574-44b9-a125-c18f6c36e146" to garbage
May 13 01:35:32 ip-10-1-1-1.compute.internal rkt[16441]: gc: pod "1ef13c00-f574-44b9-a125-c18f6c36e146" not removed: still within grace period (24h0m0s)
May 13 01:35:32 ip-10-1-1-1.compute.internal systemd[1]: Started Garbage Collection for rkt.
May 13 01:35:50 ip-10-1-1-1.compute.internal hyperkube[1444]: W0513 01:35:50.885035    1444 raw.go:87] Error while processing event ("/var/lib/rkt/pods/run/1ef13c00-f574-44b9-a125-c18f6c36e146/stage1/rootfs/opt/stage2/flannel/rootfs/sys/fs/cgroup/cpuset/kubepods/podfc94a0b2-3717-11e7-b0e3-0aba428f7c7f/77e9c2bd58c14c34c2b1a7c09147ea565b4976f59e361d18052440d006640d4b": 0x40000100 == IN_CREATE|IN_ISDIR): inotify_add_watch /var/lib/rkt/pods/run/1ef13c00-f574-44b9-a125-c18f6c36e146/stage1/rootfs/opt/stage2/flannel/rootfs/sys/fs/cgroup/cpuset/kubepods/podfc94a0b2-3717-11e7-b0e3-0aba428f7c7f/77e9c2bd58c14c34c2b1a7c09147ea565b4976f59e361d18052440d006640d4b: no such file or directory
May 13 01:35:50 ip-10-1-1-1.compute.internal hyperkube[1444]: W0513 01:35:50.887364    1444 raw.go:87] Error while processing event ("/var/lib/rkt/pods/run/1ef13c00-f574-44b9-a125-c18f6c36e146/stage1/rootfs/opt/stage2/flannel/rootfs/sys/fs/cgroup/memory/kubepods/podfc94a0b2-3717-11e7-b0e3-0aba428f7c7f/77e9c2bd58c14c34c2b1a7c09147ea565b4976f59e361d18052440d006640d4b": 0x40000100 == IN_CREATE|IN_ISDIR): inotify_add_watch /var/lib/rkt/pods/run/1ef13c00-f574-44b9-a125-c18f6c36e146/stage1/rootfs/opt/stage2/flannel/rootfs/sys/fs/cgroup/memory/kubepods/podfc94a0b2-3717-11e7-b0e3-0aba428f7c7f/77e9c2bd58c14c34c2b1a7c09147ea565b4976f59e361d18052440d006640d4b: no such file or directory
May 13 01:35:50 ip-10-1-1-1.compute.internal hyperkube[1444]: W0513 01:35:50.888770    1444 raw.go:87] Error while processing event ("/var/lib/rkt/pods/run/1ef13c00-f574-44b9-a125-c18f6c36e146/stage1/rootfs/opt/stage2/flannel/rootfs/sys/fs/cgroup/cpu,cpuacct/kubepods/podfc94a0b2-3717-11e7-b0e3-0aba428f7c7f/77e9c2bd58c14c34c2b1a7c09147ea565b4976f59e361d18052440d006640d4b": 0x40000100 == IN_CREATE|IN_ISDIR): inotify_add_watch /var/lib/rkt/pods/run/1ef13c00-f574-44b9-a125-c18f6c36e146/stage1/rootfs/opt/stage2/flannel/rootfs/sys/fs/cgroup/cpu,cpuacct/kubepods/podfc94a0b2-3717-11e7-b0e3-0aba428f7c7f/77e9c2bd58c14c34c2b1a7c09147ea565b4976f59e361d18052440d006640d4b: no such file or directory
May 13 01:35:54 ip-10-1-1-1.compute.internal hyperkube[1444]: I0513 01:35:54.938519    1444 kuberuntime_manager.go:458] Container {Name:newrelic-agent Image:newrelic/nrsysmond:latest Command:[] Args:[] WorkingDir: Ports:[] EnvFrom:[] Env:[{Name:NRSYSMOND_license_key Value:13ba6285bc99e43b4553217023d50d9aca2ca5eb ValueFrom:nil} {Name:NRSYSMOND_logfile Value:/var/log/nrsysmond.log ValueFrom:nil}] Resources:{Limits:map[cpu:{i:{value:100 scale:-3} d:{Dec:<nil>} s:100m Format:DecimalSI} memory:{i:{value:268435456 scale:0} d:{Dec:<nil>} s: Format:BinarySI}] Requests:map[cpu:{i:{value:100 scale:-3} d:{Dec:<nil>} s:100m Format:DecimalSI} memory:{i:{value:268435456 scale:0} d:{Dec:<nil>} s: Format:BinarySI}]} VolumeMounts:[{Name:dev-vol ReadOnly:false MountPath:/dev SubPath:} {Name:docker-sock ReadOnly:false MountPath:/var/run/docker.sock SubPath:} {Name:sys-vol ReadOnly:true MountPath:/sys SubPath:} {Name:default-token-csw5g ReadOnly:true MountPath:/var/run/secrets/kubernetes.io/serviceaccount SubPath:}] LivenessProbe:nil ReadinessProbe:nil Lifecycle:nil TerminationMessagePath:/dev/termination-log TerminationMessagePolicy:File ImagePullPolicy:Always SecurityContext:&SecurityContext{Capabilities:nil,Privileged:*true,SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,ReadOnlyRootFilesystem:nil,} Stdin:false StdinOnce:false TTY:false} is dead, but RestartPolicy says that we should restart it.
May 13 01:35:55 ip-10-1-1-1.compute.internal hyperkube[1444]: W0513 01:35:55.042657    1444 raw.go:87] Error while processing event ("/var/lib/rkt/pods/run/1ef13c00-f574-44b9-a125-c18f6c36e146/stage1/rootfs/opt/stage2/flannel/rootfs/sys/fs/cgroup/cpuset/kubepods/podfc94a0b2-3717-11e7-b0e3-0aba428f7c7f/172fc49caad3491df8b63df5d992f85190b7c08071cd0327bd3bdb2105d5a072": 0x40000100 == IN_CREATE|IN_ISDIR): inotify_add_watch /var/lib/rkt/pods/run/1ef13c00-f574-44b9-a125-c18f6c36e146/stage1/rootfs/opt/stage2/flannel/rootfs/sys/fs/cgroup/cpuset/kubepods/podfc94a0b2-3717-11e7-b0e3-0aba428f7c7f/172fc49caad3491df8b63df5d992f85190b7c08071cd0327bd3bdb2105d5a072: no such file or directory
May 13 01:35:55 ip-10-1-1-1.compute.internal hyperkube[1444]: W0513 01:35:55.044934    1444 raw.go:87] Error while processing event ("/var/lib/rkt/pods/run/1ef13c00-f574-44b9-a125-c18f6c36e146/stage1/rootfs/opt/stage2/flannel/rootfs/sys/fs/cgroup/memory/kubepods/podfc94a0b2-3717-11e7-b0e3-0aba428f7c7f/172fc49caad3491df8b63df5d992f85190b7c08071cd0327bd3bdb2105d5a072": 0x40000100 == IN_CREATE|IN_ISDIR): inotify_add_watch /var/lib/rkt/pods/run/1ef13c00-f574-44b9-a125-c18f6c36e146/stage1/rootfs/opt/stage2/flannel/rootfs/sys/fs/cgroup/memory/kubepods/podfc94a0b2-3717-11e7-b0e3-0aba428f7c7f/172fc49caad3491df8b63df5d992f85190b7c08071cd0327bd3bdb2105d5a072: no such file or directory
May 13 01:35:55 ip-10-1-1-1.compute.internal hyperkube[1444]: W0513 01:35:55.047028    1444 raw.go:87] Error while processing event ("/var/lib/rkt/pods/run/1ef13c00-f574-44b9-a125-c18f6c36e146/stage1/rootfs/opt/stage2/flannel/rootfs/sys/fs/cgroup/cpu,cpuacct/kubepods/podfc94a0b2-3717-11e7-b0e3-0aba428f7c7f/172fc49caad3491df8b63df5d992f85190b7c08071cd0327bd3bdb2105d5a072": 0x40000100 == IN_CREATE|IN_ISDIR): inotify_add_watch /var/lib/rkt/pods/run/1ef13c00-f574-44b9-a125-c18f6c36e146/stage1/rootfs/opt/stage2/flannel/rootfs/sys/fs/cgroup/cpu,cpuacct/kubepods/podfc94a0b2-3717-11e7-b0e3-0aba428f7c7f/172fc49caad3491df8b63df5d992f85190b7c08071cd0327bd3bdb2105d5a072: no such file or directory
May 13 01:35:55 ip-10-1-1-1.compute.internal hyperkube[1444]: W0513 01:35:55.049112    1444 raw.go:87] Error while processing event ("/var/lib/rkt/pods/run/1ef13c00-f574-44b9-a125-c18f6c36e146/stage1/rootfs/opt/stage2/flannel/rootfs/sys/fs/cgroup/blkio/kubepods/podfc94a0b2-3717-11e7-b0e3-0aba428f7c7f/172fc49caad3491df8b63df5d992f85190b7c08071cd0327bd3bdb2105d5a072": 0x40000100 == IN_CREATE|IN_ISDIR): inotify_add_watch /var/lib/rkt/pods/run/1ef13c00-f574-44b9-a125-c18f6c36e146/stage1/rootfs/opt/stage2/flannel/rootfs/sys/fs/cgroup/blkio/kubepods/podfc94a0b2-3717-11e7-b0e3-0aba428f7c7f/172fc49caad3491df8b63df5d992f85190b7c08071cd0327bd3bdb2105d5a072: no such file or directory
May 13 01:35:55 ip-10-1-1-1.compute.internal hyperkube[1444]: I0513 01:35:55.141759    1444 kuberuntime_manager.go:742] checking backoff for container "newrelic-agent" in pod "newrelic-agent-rwwsm_default(fc94a0b2-3717-11e7-b0e3-0aba428f7c7f)"
May 13 01:35:55 ip-10-1-1-1.compute.internal hyperkube[1444]: I0513 01:35:55.141872    1444 kuberuntime_manager.go:752] Back-off 10s restarting failed container=newrelic-agent pod=newrelic-agent-rwwsm_default(fc94a0b2-3717-11e7-b0e3-0aba428f7c7f)
May 13 01:35:55 ip-10-1-1-1.compute.internal hyperkube[1444]: E0513 01:35:55.141935    1444 pod_workers.go:182] Error syncing pod fc94a0b2-3717-11e7-b0e3-0aba428f7c7f ("newrelic-agent-rwwsm_default(fc94a0b2-3717-11e7-b0e3-0aba428f7c7f)"), skipping: failed to "StartContainer" for "newrelic-agent" with CrashLoopBackOff: "Back-off 10s restarting failed container=newrelic-agent pod=newrelic-agent-rwwsm_default(fc94a0b2-3717-11e7-b0e3-0aba428f7c7f)"

Reproduction Steps

  1. Start Container Linux with version 1353.7.0
  2. Start a flannel backed by etcd
  3. Start kubelet with after flannel with the below option

    [Unit]
    Description=Kubelet
    Documentation=https://github.com/kubernetes/kubernetes
    Requires=flanneld.service
    After=flanneld.service
    [Service]
    EnvironmentFile=-/etc/etcd-environment
    ExecStartPre=/usr/bin/wget -N -P /opt/bin https://storage.googleapis.com/kubernetes-release/release/v1.6.3/bin/linux/amd64/hyperkube
    ExecStartPre=/usr/bin/chmod +x /opt/bin/hyperkube
    ExecStartPre=/usr/bin/systemctl is-active flanneld.service
    ExecStartPre=/usr/bin/mkdir -p /var/log/containers
    ExecStart=/opt/bin/hyperkube \
    kubelet \
    --allow-privileged=true \
    --api-servers=https://k8s-api:443 \
    --cloud-provider=aws \
    --cluster-dns=10.100.0.30 \
    --cluster-domain=cluster.local \
    --container-runtime=docker \
    --register-node=true \
    --kubeconfig=/var/lib/kubelet/kubeconfig \
    --tls-cert-file=/etc/kubernetes/ssl/ca.pem \
    --tls-private-key-file=/etc/kubernetes/ssl/key.pem
    Restart=always
    RestartSec=5

Other Information

Temporary fix was to add ExecStop=-/usr/bin/rkt rm --uuid-file=/var/lib/coreos/flannel-wrapper2.uuid as dropin to flannel-docker-opts.service

stibi commented 7 years ago

Hi, I just run into the very same issue. I have applied your workaround, 12h passed and so far everything is running fine. Thanks a milion for your effort and awesome debugging skills!

Btw, I'd like to understand the root cause. The garbage collector service somehow brake flanneld, which in turn kill all pods on a node? How is it fixed by restarting kubelet? It reloads also flanneld? I just started digging into it…

matt-duch commented 7 years ago

I can trigger this issue at will, even with this fix enabled. More information at https://github.com/kubernetes/kubernetes/issues/49835

cflee commented 7 years ago

I'm observing similar problems, with Container Linux 1465.6.0 on a KVM environment, Kubernetes coreos/hyperkube:v1.6.4_coreos.0, Flannel coreos/flannel:v0.7.1. I see that at 12 hours, the rkt garbage collector kicks in, and the end state is that all non-/pause Docker containers get repeatedly terminated.

However one config difference is that my kubelet.service does not specify Requires= or After= flanneld.service, and does not have ExecStartPre=/usr/bin/systemctl is-active flanneld.service.

State of the system while things are stuck:

# cat /etc/os-release
NAME="Container Linux by CoreOS"
ID=coreos
VERSION=1465.6.0
VERSION_ID=1465.6.0
BUILD_ID=2017-08-16-0012
PRETTY_NAME="Container Linux by CoreOS 1465.6.0 (Ladybug)"
ANSI_COLOR="38;5;75"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://issues.coreos.com"
COREOS_BOARD="amd64-usr"
# systemctl list-timers --all
NEXT                         LEFT          LAST                         PASSED  UNIT                         ACTIVATES
Fri 2017-09-08 16:16:06 SGT  1h 27min left Fri 2017-09-08 04:16:06 SGT  10h ago rkt-gc.timer                 rkt-gc.service
Fri 2017-09-08 16:31:06 SGT  1h 42min left Thu 2017-09-07 16:31:06 SGT  22h ago systemd-tmpfiles-clean.timer systemd-tmpfiles-clean.service
Sat 2017-09-09 00:00:00 SGT  9h left       Fri 2017-09-08 00:00:12 SGT  14h ago logrotate.timer              logrotate.service
n/a                          n/a           n/a                          n/a     update-engine-stub.timer     update-engine-stub.service

4 timers listed.
# rkt list --full
UUID                    APP     IMAGE NAME                  IMAGE ID        STATE       CREATED     STARTED             NETWORKS
15c22907-7307-4270-a260-c2fbd4b60f3d    hyperkube   quay.io/coreos/hyperkube:v1.6.4_coreos.0    sha512-3b5de58b2f2d exited garbage  2017-09-07 16:09:45 +0800 SGT   2017-09-07 16:09:45 +0800 SGT
4ee5856a-a8f8-4a61-8c5d-305382f2e888    hyperkube   quay.io/coreos/hyperkube:v1.6.4_coreos.0    sha512-3b5de58b2f2d running     2017-09-07 16:16:06 +0800 SGT   2017-09-07 16:16:06 +0800 SGT
929aa77f-0e48-4c57-8ee6-d1a86164911a    flannel     quay.io/coreos/flannel:v0.7.1           sha512-f5a1ca0ed2aa running     2017-09-07 16:16:08 +0800 SGT   2017-09-07 16:16:08 +0800 SGT
95b6db3e-cb76-4bf2-aec4-a8e99cccb74d    flannel     quay.io/coreos/flannel:v0.7.1           sha512-f5a1ca0ed2aa exited garbage  2017-09-07 16:16:09 +0800 SGT   2017-09-07 16:16:09 +0800 SGT
c6233d72-6b24-4cf9-a39f-d42d9f69503b    etcd        quay.io/coreos/etcd:v3.1.8          sha512-b3659e21a58d running     2017-09-07 16:16:06 +0800 SGT   2017-09-07 16:16:06 +0800 SGT

Also, I do not see many QoS related errors, only this happens very infrequently:

W0906 20:35:12.290371     737 qos_container_manager_linux.go:133] [ContainerManager] Failed to reserve QoS requests: failed to set supported cgroup subsystems for cgroup /kubepods/burstable: Failed to set config for supported subsystems : failed to write 2032 to cpu.shares: open /var/lib/rkt/pods/run/53f9dbbf-c59e-40df-8baa-d9387f09f0c0/stage1/rootfs/opt/stage2/flannel/rootfs/sys/fs/cgroup/cpu,cpuacct/kubepods/burstable/cpu.shares: no such file or directory
cflee commented 7 years ago

Update: I'm able to replicate this by rebooting my Container Linux nodes, waiting for them to come up, and then running rkt gc. This merely moves the exited flannel node to exited garbage, without actually removing it as the grace period of 12h hasn't been reached, but that's enough to trigger all the Docker containers to start crashing continuously. Trying to do rkt gc --grace-period=0s at this point fails.

If I just systemctl restart flanneld at this point, nothing gets fixed. After systemctl restart kubelet, then the Docker containers come back up (manage to stay up?) and then rkt gc --grace-period=0s can complete.

When I rollback Container Linux to 1353.7.0 (these nodes were recently upgraded from that to 1465.6.0, just before 1465.7.0 was released), I can't replicate this problem using the above steps. I would try to bisect the stable releases to determine when this started, but it seems extremely inconvenient to reverse-engineer the update_engine just to do that.

spunon commented 7 years ago

I can confirm this exact issue exists in CoreOS 1409.7.0 as well

cflee commented 7 years ago

After some further testing, I think our case is a bit different from @oba11. We run kubelet inside rkt, and mount /var/lib/rkt into the kubelet rkt pod.

The workaround we found was to remove that mount (it's unnecessary since we're using --container-runtime=docker) then the rkt gc to move the exited flannel-docker-opts rkt pod from exited to exited garbage can complete normally without causing the CrashLoop.

wladh commented 6 years ago

I found the likely root cause of this and I've updated kubernetes/kubernetes#49835 .

euank commented 6 years ago

I'm hopeful that https://github.com/google/cadvisor/pull/1792 will fix this. I haven't done manual testing yet to be more confident.

davide-talesco commented 6 years ago

Still experiencing the same issue as @oba11 with the follwoing environment:

staging-worker-0 Ready 3m v1.9.0 192.168.2.21 Container Linux by CoreOS 1520.9.0 (Ladybug) 4.13.16-coreos-r1

the workaround of deleting the exited rkt pods created by flannel-docker-opts.service seem to work but I was wondering if there has been a proper upstream fix for this?