coreos / tectonic-forum

Apache License 2.0
30 stars 9 forks source link

failed to open log file "/var/log/pods/***/" no such file or directory #183

Open lwolf opened 7 years ago

lwolf commented 7 years ago

Issue Report Template

Tectonic Version

Tectonic 1.7.1-tectonic.2 Kubernetes 1.7.1+tectonic.1 coreos - 1506.0.0 channel - Alpha

Environment

Baremetal cluster on multiple esxi boxes

Expected Behavior

kubectl logs -f POD should tail logs

Actual Behavior

kubectl logs -f kube-proxy-d9zs0 --namespace=kube-system
failed to open log file "/var/log/pods/e3aa23d9-7f24-11e7-afb8-000c29adc3ab/kube-proxy_0.log": open /var/log/pods/e3aa23d9-7f24-11e7-afb8-000c29adc3ab/kube-proxy_0.log: no such file or directory%

After a node reboot, a lot of containers have this problem. Killing the container or rebooting docker itself heels it, until the next reboot.

kbrwn commented 7 years ago

Are there any scheduling failures shown kubectl get events? It is likely that docker is not properly creating symlinks needed for kubectl logs to function properly. This snippet has a comment about why the failure to create this symlink is ignored:

// StartContainer starts the container.
func (ds *dockerService) StartContainer(containerID string) error {
    err := ds.client.StartContainer(containerID)
    if err != nil {
        return fmt.Errorf("failed to start container %q: %v", containerID, err)
    }
    // Create container log symlink.
    if err := ds.createContainerLogSymlink(containerID); err != nil {
        // Do not stop the container if we failed to create symlink because:
        //   1. This is not a critical failure.
        //   2. We don't have enough information to properly stop container here.
        // Kubelet will surface this error to user via an event.
        return err
    }
    return nil
}
lwolf commented 7 years ago

You are right, it is exactly the problem. I can see this errors on the nodes.

Aug 31 12:33:31 master-01 kubelet-wrapper[792]: E0831 12:33:31.572371     792 pod_workers.go:182] Error syncing pod 0a051756-8d6c-11e7-b698-000c29bb38fc ("kube-apiserver-vswlm_kube-system(0a051756-8d6c-11e7-b698-000c29bb38fc)"), skipping: failed to "StartContainer" for "kube-apiserver" with rpc error: code = 2 desc = failed to create symbolic link "/var/log/pods/0a051756-8d6c-11e7-b698-000c29bb38fc/kube-apiserver_0.log" to the container log file "/var/lib/docker/containers/2ae7f627ec78bd4160a9b16d88b7985dc72b78421e53c05b471c7735adbb7b02/2ae7f627ec78bd4160a9b16d88b7985dc72b78421e53c05b471c7735adbb7b02-json.log" for container "2ae7f627ec78bd4160a9b16d88b7985dc72b78421e53c05b471c7735adbb7b02": symlink /var/lib/docker/containers/2ae7f627ec78bd4160a9b16d88b7985dc72b78421e53c05b471c7735adbb7b02/2ae7f627ec78bd4160a9b16d88b7985dc72b78421e53c05b471c7735adbb7b02-json.log /var/log/pods/0a051756-8d6c-11e7-b698-000c29bb38fc/kube-apiserver_0.log: file exists: "Start Container Failed"

I didn't see events about scheduling errors, but sometimes I see events about timeouts for random pods. Don't know whether it is connected.

Liveness probe failed: Get http://10.2.4.112:80/health: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

Is there any way to understand why it is the issue?

kbrwn commented 7 years ago

@lwolf We should try to narrow down if this is a issue with k8s or with the docker daemon. does the docker logs command work on a container where kubectl logs does not?

lwolf commented 7 years ago

@kbrwn Yes, it does

kubectl logs -f kube-apiserver-vswlm --namespace=kube-system
failed to open log file "/var/log/pods/0a051756-8d6c-11e7-b698-000c29bb38fc/kube-apiserver_0.log": open /var/log/pods/0a051756-8d6c-11e7-b698-000c29bb38fc/kube-apiserver_0.log: no such file or directory%
# docker ps | grep kube-apiserver-vswlm
2ae7f627ec78        quay.io/coreos/hyperkube                         "/usr/bin/flock --..."   5 days ago          Up 5 days                               k8s_kube-apiserver_kube-apiserver-vswlm_kube-system_0a051756-8d6c-11e7-b698-000c29bb38fc_0
f2ede603ddcf        gcr.io/google_containers/pause-amd64:3.0         "/pause"                 5 days ago          Up 5 days                               k8s_POD_kube-apiserver-vswlm_kube-system_0a051756-8d6c-11e7-b698-000c29bb38fc_0

# docker logs --tail 5 2ae7f627ec78
W0906 10:35:51.806632       9 controller.go:386] Resetting endpoints for master service "kubernetes" to &{{ } {kubernetes  default /api/v1/namespaces/default/endpoints/kubernetes 8bd53a59-6808-11e7-807d-000c29adc3ab 69524234 0 2017-07-13 20:19:17.312366819 +0000 UTC <nil> <nil> map[] map[] [] nil [] } [{[{192.168.1.2  <nil> <nil>}] [] [{https 443 TCP}]}]}
W0906 10:36:01.821348       9 controller.go:386] Resetting endpoints for master service "kubernetes" to &{{ } {kubernetes  default /api/v1/namespaces/default/endpoints/kubernetes 8bd53a59-6808-11e7-807d-000c29adc3ab 69524266 0 2017-07-13 20:19:17.312366819 +0000 UTC <nil> <nil> map[] map[] [] nil [] } [{[{192.168.1.2  <nil> <nil>}] [] [{https 443 TCP}]}]}
W0906 10:36:11.834061       9 controller.go:386] Resetting endpoints for master service "kubernetes" to &{{ } {kubernetes  default /api/v1/namespaces/default/endpoints/kubernetes 8bd53a59-6808-11e7-807d-000c29adc3ab 69524301 0 2017-07-13 20:19:17.312366819 +0000 UTC <nil> <nil> map[] map[] [] nil [] } [{[{192.168.1.2  <nil> <nil>}] [] [{https 443 TCP}]}]}
W0906 10:36:21.853665       9 controller.go:386] Resetting endpoints for master service "kubernetes" to &{{ } {kubernetes  default /api/v1/namespaces/default/endpoints/kubernetes 8bd53a59-6808-11e7-807d-000c29adc3ab 69524336 0 2017-07-13 20:19:17.312366819 +0000 UTC <nil> <nil> map[] map[] [] nil [] } [{[{192.168.1.2  <nil> <nil>}] [] [{https 443 TCP}]}]}
W0906 10:36:31.868002       9 controller.go:386] Resetting endpoints for master service "kubernetes" to &{{ } {kubernetes  default /api/v1/namespaces/default/endpoints/kubernetes 8bd53a59-6808-11e7-807d-000c29adc3ab 69524367 0 2017-07-13 20:19:17.312366819 +0000 UTC <nil> <nil> map[] map[] [] nil [] } [{[{192.168.1.2  <nil> <nil>}] [] [{https 443 TCP}]}]}
derekperkins commented 6 years ago

see https://github.com/kubernetes/kubernetes/issues/45911

lrepolho commented 6 years ago

Maybe you're here because you made the same mistake as I did: My image was trying to execute an ENTRYPOINT in a file that didn't exist. I discovered that by executing: kubectl inspect pod <podname>

seubert commented 6 years ago

@lrepolho THANK YOU. Saved me a lot of troubleshooting.