kubernetes-1.6: build is never ends

Bregor commented 7 years ago

Not sure which repo is appropriate for this issue, so will leave it here ;)

Environment:

$ kubectl version --short 
Client Version: v1.6.1
Server Version: v1.6.1

$ helm list
NAME                    REVISION        UPDATED                         STATUS          CHART                           NAMESPACE
deis                    2               Fri Apr  7 12:00:58 2017        DEPLOYED        workflow-v2.13.0                deis

DC:

GKE
Bare Metal

Logs:

$ stern -t -n deis 'builder|controller'

deis-builder 2017-04-07T14:04:31.785381665Z Accepted connection.
deis-builder 2017-04-07T14:04:32.948655499Z Starting ssh authentication
deis-controller 2017-04-07T14:04:33.011059142Z 10.104.4.62 "GET /v2/hooks/key/f5:99:9d:d8:61:44:d4:7b:df:e2:34:a1:c7:c9:7e:26 HTTP/1.1" 200 88 "deis-builder"
deis-builder 2017-04-07T14:04:33.361007425Z Channel type: session
deis-builder 2017-04-07T14:04:33.361058635Z
deis-builder 2017-04-07T14:04:33.529354005Z Received request of type auth-agent-req@openssh.com
deis-builder 2017-04-07T14:04:33.529402743Z
deis-builder 2017-04-07T14:04:33.529466321Z Key='LANG', Value='ru_RU.UTF-8'
deis-builder 2017-04-07T14:04:33.529517263Z
deis-builder 2017-04-07T14:04:33.529884481Z receiving git repo name: portby-staging-test.git, operation: git-receive-pack, fingerprint: f5:99:9d:d8:61:44:d4:7b:df:e2:3:a1:c7:c9:7e:26, user: vkfont
deis-builder 2017-04-07T14:04:33.529921485Z creating repo directory /home/git/portby-staging-test.git
deis-builder 2017-04-07T14:04:33.569422888Z writing pre-receive hook under /home/git/portby-staging-test.git
deis-builder 2017-04-07T14:04:33.5697572Z git-shell -c git-receive-pack 'portby-staging-test.git'
deis-builder 2017-04-07T14:04:35.678336614Z Waiting for git-receive to run.
deis-builder 2017-04-07T14:04:35.678382852Z Waiting for deploy.
deis-controller 2017-04-07T14:04:36.22369714Z 10.104.4.62 "POST /v2/hooks/config/ HTTP/1.1" 200 526 "deis-builder"

$ git push deis master

-----> Build succeeded!
-----> Discovering process types
       Default process types for Node.js -> web
-----> Checking for changes inside the cache directory...
       Files inside cache folder changed, uploading new cache...
       Done: Uploaded cache (15M)
-----> Compiled slug size is 33M

Nothing happens after it. Timeout indicates with

---> ---> ---> ---> ---> ---> ---> ---> ---> [ERROR] Failed git receive: portby-staging lock exceeded timeout

after five minutes of waiting.

As possible solution I tried to add exclusive rights for all deis-related ServiceAccounts with following clusterrolebinding without any success:

kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
  name: deis:destroy-all-the-things
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: cluster-admin
subjects:
- kind: ServiceAccount
  name: default
  namespace: deis
- kind: ServiceAccount
  name: deis-builder
  namespace: deis
- kind: ServiceAccount
  name: deis-controller
  namespace: deis
- kind: ServiceAccount
  name: deis-database
  namespace: deis
- kind: ServiceAccount
  name: deis-logger
  namespace: deis
- kind: ServiceAccount
  name: deis-logger-fluentd
  namespace: deis
- kind: ServiceAccount
  name: deis-monitor-telegraf
  namespace: deis
- kind: ServiceAccount
  name: deis-nsqd
  namespace: deis
- kind: ServiceAccount
  name: deis-registry
  namespace: deis
- kind: ServiceAccount
  name: deis-workflow-manager
  namespace: deis

afoninsky commented 7 years ago

Same issue here. GKE 1.6.0

bacongobbler commented 7 years ago

Additional context from slack, which on first impression appears to be a kubernetes issue:

deis      2017-04-07 12:11:36 +0300 MSK   2017-04-07 12:11:36 +0300 MSK   1         slugbuild-portby-staging-ba3e805b-411a9f50   Pod                 Warning   FailedMount   kubelet, gke-usa-production-himem-f4630381-06z4   Unable to mount volumes for pod "slugbuild-portby-staging-ba3e805b-411a9f50_deis(d873a412-1b71-11e7-9a69-42010a800009)": timeout expired waiting for volumes to attach/mount for pod "deis"/"slugbuild-portby-staging-ba3e805b-411a9f50". list of unattached/unmounted volumes=[objectstorage-keyfile portby-staging-build-env default-token-60nnj]

@Bregor can you check and see if those secrets exist? Not sure what default-token-60nnj is

Bregor commented 7 years ago

@bacongobbler this message appears only after deploy timed out (if it matters).

Bregor commented 7 years ago

@bacongobbler default-token-60nnj is present.

Bregor commented 7 years ago

BTW, is there any intersections with this issue: https://github.com/deis/workflow/issues/372?

bacongobbler commented 7 years ago

It's possible, yes. Both @mboersma and myself haven't had issues pushing apps when testing v2.13.0 on Azure or Minikube on k8s v1.6.1.

jordan-thoms commented 7 years ago

Any updates on this issue? It's preventing me from trying out deis, as our cluster is on 1.6 and due to separate bugs the downgrade of kubernetes from 1.6 to 1.5.6 is also broken.

bacongobbler commented 7 years ago

If it's something on the kubernetes side of things, there is nothing we can do other than supply an issue to GKE. Have you tried on other cloud platforms?

Bregor commented 7 years ago

Same here on bare metal. BTW, @bacongobbler I can make empty bare metal setup with 1.6 and workflow and provide root access for you, if you want to touch it personally.

mboersma commented 7 years ago

Just confirming: I was eventually able to reproduce this issue on Kubernetes 1.6.0 on GKE and on minikube with both Workflow v2.12.0 and v2.13.0. Doesn't happen on Kubernetes v1.5.3.

mboersma commented 7 years ago

I'm seeing the same timeout behavior with Kubernetes 1.6.1 on GKE as well. 😞

mboersma commented 7 years ago

  6m        6m      1   kubelet, gke-mb-test-160-default-pool-184f6e14-zp8s spec.containers{deis-slugbuilder}   Normal      Started     Started container with id 41ecbdd3e8f9610f1ae1269ba0f892347f5a11898a380b11ccbfa11fc4c48d56
  4m        4m      1   kubelet, gke-mb-test-160-default-pool-184f6e14-zp8s                     Warning     FailedMount Unable to mount volumes for pod "slugbuild-fabled-elephant-e91bdc46-064c886d_deis(6713ce2a-2462-11e7-8fd4-42010af001d4)": timeout expired waiting for volumes to attach/mount for pod "deis"/"slugbuild-fabled-elephant-e91bdc46-064c886d". list of unattached/unmounted volumes=[objectstorage-keyfile fabled-elephant-build-env default-token-tcl18]
  4m        4m      1   kubelet, gke-mb-test-160-default-pool-184f6e14-zp8s                     Warning     FailedSync  Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "deis"/"slugbuild-fabled-elephant-e91bdc46-064c886d". list of unattached/unmounted volumes=[objectstorage-keyfile fabled-elephant-build-env default-token-tcl18]

Sadly, this looks like a regression in Kubernetes similar to deis/workflow#372. (Does Kubernetes even have tests involving real-world charts?) For now, I'm going to warn users against installing Kubernetes 1.6.x until this bug is fixed, but if anyone arrives at a workaround we would love to hear about it.

sionide21 commented 7 years ago

It appears that kubectl logs --follow is not terminating when the pod exits and so the builder is hanging here: https://github.com/deis/builder/blob/85725b2a8cdee3ad3a9967279150b85c1904ab25/pkg/gitreceive/build.go#L268

bacongobbler commented 7 years ago

https://github.com/kubernetes/kubernetes/issues/43515 appears somewhat relevant, though that is more geared towards PV mounts, not secrets.

sionide21 commented 7 years ago

Would it be possible to run waitForPodEnd in a goroutine and close rc when it finishes?

That ought to cause io.Copy to return.

sionide21 commented 7 years ago

Or, simpler yet, just run io.Copy in a goroutine

bacongobbler commented 7 years ago

The pod uploads a slug and other metadata to the object store after it is finished, which further steps down below rely upon so I don't think that's a solution. Additionally, running io.Copy in a goroutine would jumble build logs with other steps occurring during the build stage, and even still would cause memory leaks because the goroutine would never finish (assuming that's the issue).

I think we should identify and confirm the root issue first, then come up with a good workaround or file an issue upstream. We then can come up with a patch ourselves for the builder or patch kubernetes (if it's found that we can't work around it ourselves) and wait for an upstream release with the patch, asking users to stick with 1.5.3 for the time being.

bacongobbler commented 7 years ago

Also to confirm I hit this bug as well on minikube w/ k8s v1.6.0 and v2.13.0. Using minikube logs:

Apr 19 17:28:11 minikube localkube[3779]: E0419 17:28:11.242706    3779 kubelet.go:1549] Unable to mount volumes for pod "slugbuild-go-e91bdc46-86c1ffb7_deis(fe13c4a4-2524-11e7-b1e2-080027ce5f69)": timeout expired waiting for volumes to attach/mount for pod "deis"/"slugbuild-go-e91bdc46-86c1ffb7". list of unattached/unmounted volumes=[objectstorage-keyfile go-build-env default-token-pxkb7]; skipping pod
Apr 19 17:28:11 minikube localkube[3779]: E0419 17:28:11.242887    3779 pod_workers.go:182] Error syncing pod fe13c4a4-2524-11e7-b1e2-080027ce5f69 ("slugbuild-go-e91bdc46-86c1ffb7_deis(fe13c4a4-2524-11e7-b1e2-080027ce5f69)"), skipping: timeout expired waiting for volumes to attach/mount for pod "deis"/"slugbuild-go-e91bdc46-86c1ffb7". list of unattached/unmounted volumes=[objectstorage-keyfile go-build-env default-token-pxkb7]

Some users have reported that it's good to enable more verbose logging for the kubelet to see why the volumes are failing to attach, so I'll try that next.

bacongobbler commented 7 years ago

Bit more verbose information. Notice the docker_sandbox errors indicating that it can't find the network status for the pod:

Apr 19 17:24:06 minikube localkube[3779]: I0419 17:24:06.991874    3779 event.go:217] Event(v1.ObjectReference{Kind:"Pod", Namespace:"deis", Name:"slugbuild-go-e91bdc46-86c1ffb7", UID:"fe13c4a4-2524-11e7-b1e2-080027ce5f69", APIVersion:"v1", ResourceVersion:"6619", FieldPath:""}): type: 'Normal' reason: 'Scheduled' Successfully assigned slugbuild-go-e91bdc46-86c1ffb7 to minikube
Apr 19 17:24:07 minikube localkube[3779]: W0419 17:24:07.472401    3779 docker_sandbox.go:263] Couldn't find network status for deis/slugbuild-go-e91bdc46-86c1ffb7 through plugin: invalid network status for
Apr 19 17:24:08 minikube localkube[3779]: W0419 17:24:08.204843    3779 docker_sandbox.go:263] Couldn't find network status for deis/slugbuild-go-e91bdc46-86c1ffb7 through plugin: invalid network status for
Apr 19 17:25:40 minikube localkube[3779]: W0419 17:25:40.035329    3779 docker_sandbox.go:263] Couldn't find network status for deis/slugbuild-go-e91bdc46-86c1ffb7 through plugin: invalid network status for
Apr 19 17:26:11 minikube localkube[3779]: W0419 17:26:11.239871    3779 docker_sandbox.go:263] Couldn't find network status for deis/slugbuild-go-e91bdc46-86c1ffb7 through plugin: invalid network status for
Apr 19 17:26:12 minikube localkube[3779]: W0419 17:26:12.108681    3779 docker_sandbox.go:263] Couldn't find network status for deis/slugbuild-go-e91bdc46-86c1ffb7 through plugin: invalid network status for
Apr 19 17:26:12 minikube localkube[3779]: W0419 17:26:12.247901    3779 docker_sandbox.go:263] Couldn't find network status for deis/slugbuild-go-e91bdc46-86c1ffb7 through plugin: invalid network status for
Apr 19 17:28:11 minikube localkube[3779]: E0419 17:28:11.242706    3779 kubelet.go:1549] Unable to mount volumes for pod "slugbuild-go-e91bdc46-86c1ffb7_deis(fe13c4a4-2524-11e7-b1e2-080027ce5f69)": timeout expired waiting for volumes to attach/mount for pod "deis"/"slugbuild-go-e91bdc46-86c1ffb7". list of unattached/unmounted volumes=[objectstorage-keyfile go-build-env default-token-pxkb7]; skipping pod
Apr 19 17:28:11 minikube localkube[3779]: E0419 17:28:11.242887    3779 pod_workers.go:182] Error syncing pod fe13c4a4-2524-11e7-b1e2-080027ce5f69 ("slugbuild-go-e91bdc46-86c1ffb7_deis(fe13c4a4-2524-11e7-b1e2-080027ce5f69)"), skipping: timeout expired waiting for volumes to attach/mount for pod "deis"/"slugbuild-g-e91bdc46-86c1ffb7". list of unattached/unmounted volumes=[objectstorage-keyfile go-build-env default-token-pxkb7]

sionide21 commented 7 years ago

The only parts that would run simultaneously are the io.Copy and the waitForPodEnd. They are actually triggering on the same condition - the container exiting - but now the io.Copy isn't automatically aware of that condition.

Once the pod is detected as having exited, it's safe to stop reading from rc because there's nothing writing to it.

bacongobbler commented 7 years ago

Relatedly, I just refactored the entire k8s client library to use k8s.io/client-go, bumping us up to the latest and greatest client libs. We were using v1.2.4 before. That may or may not help assist with the root issue, but it certainly doesn't hurt to try.

bacongobbler commented 7 years ago

PR: https://github.com/deis/builder/pull/509

bacongobbler commented 7 years ago

If io.Copy isn't seeing io.EOF when fetching pod logs on k8s v1.6 then I'm going to to assume it's a client regression. This code works on v1.5.3 as @mboersma previously pointed out so there's a code regression somewhere upstream... Just a matter of figuring out why io.Copy isn't seeing io.EOF.

I still want to figure out the root cause before we start refactoring the original code's behaviour and introduce more regressions.

bacongobbler commented 7 years ago

https://github.com/kubernetes/kubernetes/pull/44406 seems to be the relevant fix which made it into v1.6.2. If anyone has time to test v1.6.2, that would be helpful.

bacongobbler commented 7 years ago

I was able to verify that this has been fixed in Kubernetes v1.6.2. I'd highly suggest everyone on Kubernetes 1.6 to upgrade to the latest patch release.

-----> Compiled slug size is 1.9M
Build complete.
Launching App...
...
Done, go:v2 deployed to Workflow
$ curl go.fishr.pw
Powered by Deis
Release v2 on go-web-1961603935-xz07w

deis / builder

kubernetes-1.6: build is never ends #507