deis / builder

Git server and application builder for Deis Workflow
https://deis.com
MIT License
40 stars 41 forks source link

[Meta] Race condition for when builder sometimes misses build pod runs #298

Closed arschles closed 8 years ago

arschles commented 8 years ago

Summary

In cases where the slugbulder or dockerbuilder pod launches, runs and finishes before the builder starts looking for it, the builder will hang forever.

If You Are Reporting a Bug

Please see the description above.

Replication Steps

As this is a race condition, it is non-deterministic. It has been reported in a variety of different scenarios, but here are some tips to increase the likelihood that it arises:

  1. Ensure that the slugbuilder or dockerbuilder pod (depending on the type of your app) is already pulled on all nodes in your cluster.
  2. Push a repository that will build quickly
  3. Use fast object storage for your k8s deployment. For example, if you're running on GKE, use GCS.

When you've met the above conditions, simply follow these steps:

  1. Create a deis cluster
  2. Create a deis app (deis create myapp)
  3. git push deis master

Desired result:

The builder hangs. The output immediately before the hang looks similar to the following:

Pod spec: {
  "metadata": {
    "name": "slugbuild-gotest-e96335c6-899b3d78",
    "namespace": "deis",
    "creationTimestamp": null,
    "labels": {
      "heritage": "slugbuild-gotest-e96335c6-899b3d78"
    }
  },
  "spec": {
    "volumes": [
      {
        "name": "objectstorage-keyfile",
        "secret": {
          "secretName": "objectstorage-keyfile"
        }
      }
    ],
    "containers": [
      {
        "name": "deis-slugbuilder",
        "image": "quay.io/deisci/slugbuilder:git-c55ef21",
        "env": [
          {
            "name": "DEBUG",
            "value": "1"
          },
          {
            "name": "TAR_PATH",
            "value": "home/gotest:git-e96335c6/tar"
          },
          {
            "name": "PUT_PATH",
            "value": "home/gotest:git-e96335c6/push"
          },
          {
            "name": "BUILDER_STORAGE",
            "value": "gcs"
          }
        ],
        "resources": {},
        "volumeMounts": [
          {
            "name": "objectstorage-keyfile",
            "readOnly": true,
            "mountPath": "/var/run/secrets/deis/objectstore/creds"
          }
        ],
        "imagePullPolicy": "IfNotPresent"
      }
    ],
    "restartPolicy": "Never",
    "serviceAccountName": ""
  },
  "status": {}
}

After a long period of time, the git push should fail with test similar to the below, and no app should be deployed.

remote: 2016/04/15 17:45:26 Error running git receive hook [attempting to stream logs (Get https://gke-aaron-0520c439-node-z665:10250/containerLogs/deis/slugbuild-gotest-e96335c6-899b3d78/deis-slugbuilder?follow=true: EOF)]

Related Issues

NOTE: all of the above issues should be closed (with the possible exception of the last one, if jobs aren't chosen as the solution) when this is resolved

kmala commented 8 years ago

I think we should be using jobs as it would make the solution simpler, cleaner and easy to maintain.

arschles commented 8 years ago

Punting to beta4, as a partial fix (at least) for this has been completed in #304

arschles commented 8 years ago

@kmala @smothiki has #304 completely fixed this issue?

arschles commented 8 years ago

Punting to RC1 for now

kmala commented 8 years ago

57 hasn't been done yet

bacongobbler commented 8 years ago

sounds like it's been fixed. closing!