Closed Bregor closed 7 years ago
Same issue here. GKE 1.6.0
Additional context from slack, which on first impression appears to be a kubernetes issue:
deis 2017-04-07 12:11:36 +0300 MSK 2017-04-07 12:11:36 +0300 MSK 1 slugbuild-portby-staging-ba3e805b-411a9f50 Pod Warning FailedMount kubelet, gke-usa-production-himem-f4630381-06z4 Unable to mount volumes for pod "slugbuild-portby-staging-ba3e805b-411a9f50_deis(d873a412-1b71-11e7-9a69-42010a800009)": timeout expired waiting for volumes to attach/mount for pod "deis"/"slugbuild-portby-staging-ba3e805b-411a9f50". list of unattached/unmounted volumes=[objectstorage-keyfile portby-staging-build-env default-token-60nnj]
@Bregor can you check and see if those secrets exist? Not sure what default-token-60nnj is
@bacongobbler this message appears only after deploy timed out (if it matters).
@bacongobbler default-token-60nnj
is present.
BTW, is there any intersections with this issue: https://github.com/deis/workflow/issues/372?
It's possible, yes. Both @mboersma and myself haven't had issues pushing apps when testing v2.13.0 on Azure or Minikube on k8s v1.6.1.
Any updates on this issue? It's preventing me from trying out deis, as our cluster is on 1.6 and due to separate bugs the downgrade of kubernetes from 1.6 to 1.5.6 is also broken.
If it's something on the kubernetes side of things, there is nothing we can do other than supply an issue to GKE. Have you tried on other cloud platforms?
Same here on bare metal. BTW, @bacongobbler I can make empty bare metal setup with 1.6 and workflow and provide root access for you, if you want to touch it personally.
Just confirming: I was eventually able to reproduce this issue on Kubernetes 1.6.0 on GKE and on minikube
with both Workflow v2.12.0 and v2.13.0. Doesn't happen on Kubernetes v1.5.3.
I'm seeing the same timeout behavior with Kubernetes 1.6.1 on GKE as well. 😞
6m 6m 1 kubelet, gke-mb-test-160-default-pool-184f6e14-zp8s spec.containers{deis-slugbuilder} Normal Started Started container with id 41ecbdd3e8f9610f1ae1269ba0f892347f5a11898a380b11ccbfa11fc4c48d56
4m 4m 1 kubelet, gke-mb-test-160-default-pool-184f6e14-zp8s Warning FailedMount Unable to mount volumes for pod "slugbuild-fabled-elephant-e91bdc46-064c886d_deis(6713ce2a-2462-11e7-8fd4-42010af001d4)": timeout expired waiting for volumes to attach/mount for pod "deis"/"slugbuild-fabled-elephant-e91bdc46-064c886d". list of unattached/unmounted volumes=[objectstorage-keyfile fabled-elephant-build-env default-token-tcl18]
4m 4m 1 kubelet, gke-mb-test-160-default-pool-184f6e14-zp8s Warning FailedSync Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "deis"/"slugbuild-fabled-elephant-e91bdc46-064c886d". list of unattached/unmounted volumes=[objectstorage-keyfile fabled-elephant-build-env default-token-tcl18]
Sadly, this looks like a regression in Kubernetes similar to deis/workflow#372. (Does Kubernetes even have tests involving real-world charts?) For now, I'm going to warn users against installing Kubernetes 1.6.x until this bug is fixed, but if anyone arrives at a workaround we would love to hear about it.
It appears that kubectl logs --follow
is not terminating when the pod exits and so the builder is hanging here: https://github.com/deis/builder/blob/85725b2a8cdee3ad3a9967279150b85c1904ab25/pkg/gitreceive/build.go#L268
https://github.com/kubernetes/kubernetes/issues/43515 appears somewhat relevant, though that is more geared towards PV mounts, not secrets.
Would it be possible to run waitForPodEnd in a goroutine and close rc
when it finishes?
That ought to cause io.Copy
to return.
Or, simpler yet, just run io.Copy
in a goroutine
The pod uploads a slug and other metadata to the object store after it is finished, which further steps down below rely upon so I don't think that's a solution. Additionally, running io.Copy
in a goroutine would jumble build logs with other steps occurring during the build stage, and even still would cause memory leaks because the goroutine would never finish (assuming that's the issue).
I think we should identify and confirm the root issue first, then come up with a good workaround or file an issue upstream. We then can come up with a patch ourselves for the builder or patch kubernetes (if it's found that we can't work around it ourselves) and wait for an upstream release with the patch, asking users to stick with 1.5.3 for the time being.
Also to confirm I hit this bug as well on minikube w/ k8s v1.6.0 and v2.13.0. Using minikube logs
:
Apr 19 17:28:11 minikube localkube[3779]: E0419 17:28:11.242706 3779 kubelet.go:1549] Unable to mount volumes for pod "slugbuild-go-e91bdc46-86c1ffb7_deis(fe13c4a4-2524-11e7-b1e2-080027ce5f69)": timeout expired waiting for volumes to attach/mount for pod "deis"/"slugbuild-go-e91bdc46-86c1ffb7". list of unattached/unmounted volumes=[objectstorage-keyfile go-build-env default-token-pxkb7]; skipping pod
Apr 19 17:28:11 minikube localkube[3779]: E0419 17:28:11.242887 3779 pod_workers.go:182] Error syncing pod fe13c4a4-2524-11e7-b1e2-080027ce5f69 ("slugbuild-go-e91bdc46-86c1ffb7_deis(fe13c4a4-2524-11e7-b1e2-080027ce5f69)"), skipping: timeout expired waiting for volumes to attach/mount for pod "deis"/"slugbuild-go-e91bdc46-86c1ffb7". list of unattached/unmounted volumes=[objectstorage-keyfile go-build-env default-token-pxkb7]
Some users have reported that it's good to enable more verbose logging for the kubelet to see why the volumes are failing to attach, so I'll try that next.
Bit more verbose information. Notice the docker_sandbox errors indicating that it can't find the network status for the pod:
Apr 19 17:24:06 minikube localkube[3779]: I0419 17:24:06.991874 3779 event.go:217] Event(v1.ObjectReference{Kind:"Pod", Namespace:"deis", Name:"slugbuild-go-e91bdc46-86c1ffb7", UID:"fe13c4a4-2524-11e7-b1e2-080027ce5f69", APIVersion:"v1", ResourceVersion:"6619", FieldPath:""}): type: 'Normal' reason: 'Scheduled' Successfully assigned slugbuild-go-e91bdc46-86c1ffb7 to minikube
Apr 19 17:24:07 minikube localkube[3779]: W0419 17:24:07.472401 3779 docker_sandbox.go:263] Couldn't find network status for deis/slugbuild-go-e91bdc46-86c1ffb7 through plugin: invalid network status for
Apr 19 17:24:08 minikube localkube[3779]: W0419 17:24:08.204843 3779 docker_sandbox.go:263] Couldn't find network status for deis/slugbuild-go-e91bdc46-86c1ffb7 through plugin: invalid network status for
Apr 19 17:25:40 minikube localkube[3779]: W0419 17:25:40.035329 3779 docker_sandbox.go:263] Couldn't find network status for deis/slugbuild-go-e91bdc46-86c1ffb7 through plugin: invalid network status for
Apr 19 17:26:11 minikube localkube[3779]: W0419 17:26:11.239871 3779 docker_sandbox.go:263] Couldn't find network status for deis/slugbuild-go-e91bdc46-86c1ffb7 through plugin: invalid network status for
Apr 19 17:26:12 minikube localkube[3779]: W0419 17:26:12.108681 3779 docker_sandbox.go:263] Couldn't find network status for deis/slugbuild-go-e91bdc46-86c1ffb7 through plugin: invalid network status for
Apr 19 17:26:12 minikube localkube[3779]: W0419 17:26:12.247901 3779 docker_sandbox.go:263] Couldn't find network status for deis/slugbuild-go-e91bdc46-86c1ffb7 through plugin: invalid network status for
Apr 19 17:28:11 minikube localkube[3779]: E0419 17:28:11.242706 3779 kubelet.go:1549] Unable to mount volumes for pod "slugbuild-go-e91bdc46-86c1ffb7_deis(fe13c4a4-2524-11e7-b1e2-080027ce5f69)": timeout expired waiting for volumes to attach/mount for pod "deis"/"slugbuild-go-e91bdc46-86c1ffb7". list of unattached/unmounted volumes=[objectstorage-keyfile go-build-env default-token-pxkb7]; skipping pod
Apr 19 17:28:11 minikube localkube[3779]: E0419 17:28:11.242887 3779 pod_workers.go:182] Error syncing pod fe13c4a4-2524-11e7-b1e2-080027ce5f69 ("slugbuild-go-e91bdc46-86c1ffb7_deis(fe13c4a4-2524-11e7-b1e2-080027ce5f69)"), skipping: timeout expired waiting for volumes to attach/mount for pod "deis"/"slugbuild-g-e91bdc46-86c1ffb7". list of unattached/unmounted volumes=[objectstorage-keyfile go-build-env default-token-pxkb7]
related: https://github.com/kubernetes/kubernetes/issues/43988
The only parts that would run simultaneously are the io.Copy
and the waitForPodEnd
. They are actually triggering on the same condition - the container exiting - but now the io.Copy
isn't automatically aware of that condition.
Once the pod is detected as having exited, it's safe to stop reading from rc
because there's nothing writing to it.
Relatedly, I just refactored the entire k8s client library to use k8s.io/client-go, bumping us up to the latest and greatest client libs. We were using v1.2.4 before. That may or may not help assist with the root issue, but it certainly doesn't hurt to try.
If io.Copy isn't seeing io.EOF when fetching pod logs on k8s v1.6 then I'm going to to assume it's a client regression. This code works on v1.5.3 as @mboersma previously pointed out so there's a code regression somewhere upstream... Just a matter of figuring out why io.Copy isn't seeing io.EOF.
I still want to figure out the root cause before we start refactoring the original code's behaviour and introduce more regressions.
https://github.com/kubernetes/kubernetes/pull/44406 seems to be the relevant fix which made it into v1.6.2. If anyone has time to test v1.6.2, that would be helpful.
I was able to verify that this has been fixed in Kubernetes v1.6.2. I'd highly suggest everyone on Kubernetes 1.6 to upgrade to the latest patch release.
-----> Compiled slug size is 1.9M
Build complete.
Launching App...
...
Done, go:v2 deployed to Workflow
$ curl go.fishr.pw
Powered by Deis
Release v2 on go-web-1961603935-xz07w
Not sure which repo is appropriate for this issue, so will leave it here ;)
Environment:
DC:
Logs:
Nothing happens after it. Timeout indicates with
after five minutes of waiting.
As possible solution I tried to add exclusive rights for all deis-related ServiceAccounts with following clusterrolebinding without any success: