deis / controller

Deis Workflow Controller (API)
https://deis.com
MIT License
41 stars 53 forks source link

Deployment errors can lead to wrong pods being deployed #989

Closed felixbuenemann closed 8 years ago

felixbuenemann commented 8 years ago

Due to the problem fixed by #946 deployments on our deis v2.3.0 cluster were failing most of the time, but not always.

Because version numbers are sequentially inremented and a failed deploy does not skip that bversion number, but re-uses it for the next deploy, it happened that a new deployment referenced pods from a previous failed deployment with the same version number.

Example:

Release Description Comment
v49 buenemann deployed 2aead2a successfull release
v50 gitlab deployed 7c1d5e02 this release failed with a 409 Conflict error when writing the secrets and is not listed in deis releases
v50 gitlab deployed 4e0a908 this deployment went through, but actually used the pod with image localhost:5555/cms:git-7c1d5e02 from the previous failed release, which was also tagged with version=v50 so it fit the selectors of the new deployment
v51 buenemann deployed 4e0a908 successful deploy of 4e0a908 after upgrading to v2.4.0

This ended up with the following replica set on v50:

apiVersion: extensions/v1beta1
kind: ReplicaSet
metadata: 
  annotations: 
    deployment.kubernetes.io/revision: "3"
    kubernetes.io/change-cause: gitlab deployed 7c1d5e0
  creationTimestamp: 2016-08-18T11:58:43Z
  generation: 1
  labels: 
    app: cms
    heritage: deis
    pod-template-hash: "3228967599"
    type: cmd
    version: v50
  name: cms-cmd-3228967599
  namespace: cms
  resourceVersion: "5041061"
  selfLink: /apis/extensions/v1beta1/namespaces/cms/replicasets/cms-cmd-3228967599
  uid: 1c3d409c-653b-11e6-bedf-0274740f37d1
spec: 
  replicas: 1
  selector: 
    matchLabels: 
      app: cms
      heritage: deis
      pod-template-hash: "3228967599"
      type: cmd
  template: 
    metadata: 
      creationTimestamp: null
      labels: 
        app: cms
        heritage: deis
        pod-template-hash: "3228967599"
        type: cmd
        version: v50
      name: cms-cmd
    spec: 
      containers: 
        - env: 
            - name: DATABASE_URL
              valueFrom: 
                secretKeyRef: 
                  key: database-url
                  name: cms-v50-env
            - name: DEIS_APP
              valueFrom: 
                secretKeyRef: 
                  key: deis-app
                  name: cms-v50-env
            - name: DEIS_KUBERNETES_DEPLOYMENTS
              valueFrom: 
                secretKeyRef: 
                  key: deis-kubernetes-deployments
                  name: cms-v50-env
            - name: INSTALL_TOOL_PASSWORD
              valueFrom: 
                secretKeyRef: 
                  key: install-tool-password
                  name: cms-v50-env
            - name: PORT
              valueFrom: 
                secretKeyRef: 
                  key: port
                  name: cms-v50-env
            - name: SSH_KEY
              valueFrom: 
                secretKeyRef: 
                  key: ssh-key
                  name: cms-v50-env
            - name: TRUSTED_HOSTS_PATTERN
              valueFrom: 
                secretKeyRef: 
                  key: trusted-hosts-pattern
                  name: cms-v50-env
            - name: WORKFLOW_RELEASE
              valueFrom: 
                secretKeyRef: 
                  key: workflow-release
                  name: cms-v50-env
          image: localhost:5555/cms:git-7c1d5e02
          imagePullPolicy: Always
          name: cms-cmd
          readinessProbe: 
            failureThreshold: 1
            initialDelaySeconds: 5
            periodSeconds: 5
            successThreshold: 1
            tcpSocket: 
              port: 80
            timeoutSeconds: 5
          resources: {}
          terminationMessagePath: /dev/termination-log
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      securityContext: {}
      terminationGracePeriodSeconds: 30
status: 
  fullyLabeledReplicas: 1
  observedGeneration: 1
  replicas: 1

Controller Log on deis-controller v2.3.0:

Failed v50 release (gitlab deployed 7c1d5e0):

10.2.50.8 "POST /v2/hooks/build HTTP/1.1" 200 26 "deis-builder"
10.2.50.8 "GET /v2/hooks/key/bf:67:1d:7d:0a:c3:64:79:7b:e3:9d:9e:ad:ff:f6:4a HTTP/1.1" 200 131 "deis-builder"
10.2.50.8 "POST /v2/hooks/push HTTP/1.1" 201 357 "deis-builder"
10.2.50.8 "POST /v2/hooks/config HTTP/1.1" 200 2775 "deis-builder"
INFO [cms]: build cms-7f7d7bf created
INFO:api.models.app:[cms]: build cms-7f7d7bf created
INFO [cms]: gitlab deployed 7c1d5e0
INFO:api.models.app:[cms]: gitlab deployed 7c1d5e0
INFO Pulling Docker image localhost:5555/cms:git-7c1d5e02
INFO:registry.dockerclient:Pulling Docker image localhost:5555/cms:git-7c1d5e02
INFO [cms]: adding 5s on to the original 120s timeout to account for the initial delay specified in the liveness / readiness probe
INFO:scheduler:[cms]: adding 5s on to the original 120s timeout to account for the initial delay specified in the liveness / readiness probe
INFO [cms]: This deployments overall timeout is 125s - batch timout is 125s and there are 1 batches to deploy with a total of 1 pods
INFO:scheduler:[cms]: This deployments overall timeout is 125s - batch timout is 125s and there are 1 batches to deploy with a total of 1 pods
ERROR [cms]: (app::deploy): There was a problem while deploying v50 of cms-web. Going back to the previous release
ERROR:api.models.app:[cms]: (app::deploy): There was a problem while deploying v50 of cms-web. Going back to the previous release
ERROR:root:(app::deploy): There was a problem while deploying v50 of cms-web. Going back to the previous release
Traceback (most recent call last):
  File "/app/scheduler/__init__.py", line 560, in set_application_config
    self.get_secret(namespace, secret_name)
  File "/app/scheduler/__init__.py", line 1206, in get_secret
    'get Secret "{}" in Namespace "{}"', name, namespace
scheduler.KubeHTTPException: ('failed to get Secret "cms-v50-env" in Namespace "cms": 404 Not Found\n{\'reason\': \'NotFound\', \'details\': {\'name\': \'cms-v50-env\', \'kind\': \'secrets\'}, \'metadata\': {}, \'code\': 404, \'status\': \'Failure\', \'kind\': \'Status\', \'message\': \'secrets "cms-v50-env" not found\', \'apiVersion\': \'v1\'}', 'cms-v50-env', 'cms')
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/app/scheduler/__init__.py", line 155, in deploy_deployment
    **kwargs)
  File "/app/scheduler/__init__.py", line 1692, in update_deployment
    **kwargs)
  File "/app/scheduler/__init__.py", line 1854, in _build_deployment_manifest
    manifest['spec']['template'] = self._build_pod_manifest(namespace, name, image, **kwargs)
  File "/app/scheduler/__init__.py", line 461, in _build_pod_manifest
    self._set_container(namespace, container_name, container, **kwargs)
  File "/app/scheduler/__init__.py", line 587, in _set_container
    self.set_application_config(namespace, env, kwargs.get('version'))
  File "/app/scheduler/__init__.py", line 562, in set_application_config
    self.create_secret(namespace, secret_name, secrets_env, labels=labels)
  File "/app/scheduler/__init__.py", line 1269, in create_secret
    'failed to create Secret "{}" in Namespace "{}"', name, namespace
scheduler.KubeHTTPException: ('failed to failed to create Secret "cms-v50-env" in Namespace "cms": 409 Conflict\n{\'reason\': \'AlreadyExists\', \'details\': {\'name\': \'cms-v50-env\', \'kind\': \'secrets\'}, \'metadata\': {}, \'code\': 409, \'status\': \'Failure\', \'kind\': \'Status\', \'message\': \'secrets "cms-v50-env" already exists\', \'apiVersion\': \'v1\'}', 'cms-v50-env', 'cms')
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/app/api/models/app.py", line 560, in deploy
    async_run(tasks)
  File "/app/api/utils.py", line 161, in async_run
    raise error
  File "/usr/lib/python3.5/concurrent/futures/thread.py", line 55, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/app/scheduler/__init__.py", line 114, in deploy
    self.deploy_deployment(namespace, name, image, entrypoint, command, **kwargs)
  File "/app/scheduler/__init__.py", line 169, in deploy_deployment
    ) from e
scheduler.KubeException: There was a problem while deploying v50 of cms-web. Going back to the previous release
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/app/api/models/build.py", line 64, in create
    self.app.deploy(new_release)
  File "/app/api/models/app.py", line 564, in deploy
    raise ServiceUnavailable(err) from e
api.exceptions.ServiceUnavailable: (app::deploy): There was a problem while deploying v50 of cms-web. Going back to the previous release
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/rest_framework/views.py", line 463, in dispatch
    response = handler(request, *args, **kwargs)
  File "/app/api/views.py", line 474, in create
    super(BuildHookViewSet, self).create(request, *args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/rest_framework/mixins.py", line 21, in create
    self.perform_create(serializer)
  File "/app/api/viewsets.py", line 21, in perform_create
    self.post_save(obj)
  File "/app/api/views.py", line 480, in post_save
    build.create(self.user)
  File "/app/api/models/build.py", line 71, in create
    raise DeisException(str(e)) from e
api.exceptions.DeisException: (app::deploy): There was a problem while deploying v50 of cms-web. Going back to the previous release
10.2.50.8 "POST /v2/hooks/build HTTP/1.1" 400 114 "deis-builder"

Successful v50 release (gitlab deployed 4e0a908):

10.2.50.8 "GET /v2/hooks/key/bf:67:1d:7d:0a:c3:64:79:7b:e3:9d:9e:ad:ff:f6:4a HTTP/1.1" 200 131 "deis-builder"
10.2.50.8 "POST /v2/hooks/push HTTP/1.1" 201 357 "deis-builder"
10.2.50.8 "POST /v2/hooks/config HTTP/1.1" 200 2775 "deis-builder"
INFO [cms]: build cms-db04ef1 created
INFO:api.models.app:[cms]: build cms-db04ef1 created
INFO [cms]: gitlab deployed 4e0a908
INFO:api.models.app:[cms]: gitlab deployed 4e0a908
INFO Pulling Docker image localhost:5555/cms:git-4e0a9088
INFO:registry.dockerclient:Pulling Docker image localhost:5555/cms:git-4e0a9088
INFO [cms]: Deployment cms-web with release v50 already exists. Stopping deploy
INFO:scheduler:[cms]: Deployment cms-web with release v50 already exists. Stopping deploy
INFO [cms]: Deployment cms-cmd with release v50 already exists. Stopping deploy
INFO:scheduler:[cms]: Deployment cms-cmd with release v50 already exists. Stopping deploy
10.2.50.8 "POST /v2/hooks/build HTTP/1.1" 200 26 "deis-builder"
10.2.50.13 "GET /v2/apps/cms/releases/?limit=100 HTTP/1.1" 200 14609 "Deis Client v2.3.0"
10.2.50.13 "GET /v2/apps/cms/releases/?limit=100 HTTP/1.1" 200 14609 "Deis Client v2.3.0"
10.2.50.8 "GET /v2/hooks/key/b4:8b:80:d0:e8:d4:12:14:7b:6e:5c:ad:c2:e9:74:43 HTTP/1.1" 200 134 "deis-builder"

The above log shows that it re-used the previous failed v50 release.

Output from deis releases:

v51 2016-08-18T14:33:26Z    buenemann deployed 4e0a908
v50 2016-08-18T12:12:43Z    gitlab deployed 4e0a908
v49 2016-08-18T11:27:29Z    buenemann deployed 2aead2a
v48 2016-08-18T11:06:15Z    buenemann added DEIS_KUBERNETES_DEPLOYMENTS

Git history:

4e0a9088175a2664602f243937fc557466106247
bf362e11b53675bb3d4cdbaa14478dd267f10f37
7c1d5e02c55c3869a1b940bacfb8668789b5b2cc
2aead2aaeebb54c685a53e81f211fd3deaa202ac
seanknox commented 8 years ago

I can confirm this on my v2.4.1 cluster running a v2.4.2 controller image. Ultimately the wrong pods are deployed, which fail to start trying to find secrets that presumably don't exist anymore:

{kubelet ip-172-20-123-232.ec2.internal}  Warning   FailedSync  Error syncing pod, skipping: failed to "StartContainer" for "acceptance-web" with RunContainerError: "GenerateRunContainerOptions: secrets \"acceptance-v54-env\" not found"

In v2.4.1 processes had to be manually scaled down and back up to be corrected, but v2.4.2 seems to do the right thing on subsequent (successful) deploys.

helgi commented 8 years ago

Was the app already in a broken state due to missing secret envs or you hitting 409 prior to doing the upgrade?

seanknox commented 8 years ago

@helgi Personally, I was seeing 409 errors but I'm no longer seeing them on v2.4.2, but saw the wrong pod release deployed. Here I had just deployed v28, but a v27 pod is spun up:

$ kubectl get pods --namespace=demo                                   (master)
NAME                           READY     STATUS              RESTARTS   AGE
demo-web-3789243348-pgudd      0/1       ContainerCreating   0          1m
demo-web-3789243348-tbgkp      1/1       Running             0          1m
demo-web-754923408-1lhg6       1/1       Running             0          35m
demo-worker-1879168280-s7cv4   1/1       Running             0          1m
$ kubectl get pods --namespace=demo                                   (master)
NAME                           READY     STATUS              RESTARTS   AGE
demo-web-3789243348-pgudd      0/1       Running             0          4m
demo-web-3789243348-tbgkp      1/1       Running             0          4m
demo-web-754923408-wsgrp       0/1       RunContainerError   0          1m
demo-worker-1879168280-s7cv4   1/1       Running             0          4m

A describe on the problem pod shows that it's the previous version (v27, instead of v28):

$ kubectl describe --namespace=demo pod demo-web-754923408-wsgrp      (master)
Name:       demo-web-754923408-wsgrp
Namespace:      demo
Node:       ip-172-20-143-164.ec2.internal/172.20.143.164
Start Time:     Mon, 29 Aug 2016 15:42:03 -0700
Labels:     app=demo
            heritage=deis
            pod-template-hash=754923408
            type=web
            version=v27
Status:     Pending
IP:         100.96.1.19
Controllers:    ReplicaSet/demo-web-754923408
Containers:
  demo-web:
    Container ID:
    Image:          localhost:5555/demo:git-2a5a6d89
    Image ID:
    Port:
    Command:
      /bin/bash
      -c
    Args:
      ./script/start_rails_server.sh
    State:          Waiting
      Reason:       RunContainerError
    Ready:          False
    Restart Count:      0
    Readiness:      tcp-socket :3000 delay=5s timeout=5s period=5s #success=1 #failure=1
    Environment Variables:
      AWS_DEFAULT_REGION:       <set to the key 'aws-default-region' in secret 'demo-v27-env'>
      ...
Conditions:
  Type      Status
  Initialized   True
  Ready         False
  PodScheduled  True
Volumes:
  default-token-2povj:
    Type:       Secret (a volume populated by a Secret)
    SecretName: default-token-2povj
QoS Tier:       BestEffort
Events:
  FirstSeen     LastSeen        Count   From                        SubobjectPath       Type            Reason      Message
  ---------     --------        -----   ----                        -------------       --------    ------      -------
  2m        2m          1       {default-scheduler }                            Normal          Scheduled       Successfully assigned demo-web-754923408-wsgrp to ip-172-20-143-164.ec2.internal
  1m        11s         8       {kubelet ip-172-20-143-164.ec2.internal}        spec.containers{demo-web}       Normal      Pulling     pulling image "localhost:5555/demo:git-2a5a6d89"
  1m        10s         8       {kubelet ip-172-20-143-164.ec2.internal}        spec.containers{demo-web}       Normal      Pulled      Successfully pulled image "localhost:5555/demo:git-2a5a6d89"
  1m        10s         8       {kubelet ip-172-20-143-164.ec2.internal}                    Warning         FailedSync      Error syncing pod, skipping: failed to "StartContainer" for "demo-web" with RunContainerError: "GenerateRunContainerOptions: secrets \"demo-v27-env\" not found"
seanknox commented 8 years ago

@helgi The above release was a new image; I'm seeing the same behavior using deis config:set. Erroneous pods are deployed, fail to start, and then are cleaned up.

seanknox commented 8 years ago

One other tidbit: apps with a single app process end up terminating the working pod prematurely; the bad pod then fails to start, requiring manually down to zero and back up to restore. It looks like running at least two web pods protects against this scenario.

mboersma commented 8 years ago

Another user has reported what appears to be the same scenario:

Traceback (most recent call last): 
File "/app/scheduler/__init__.py", line 155, in deploy_deployment 
**kwargs) 
File "/app/scheduler/__init__.py", line 1692, in update_deployment 
**kwargs) 
File "/app/scheduler/__init__.py", line 1854, in _build_deployment_manifest 
manifest['spec']['template'] = self._build_pod_manifest(namespace, name, image, **kwargs) 
File "/app/scheduler/__init__.py", line 461, in _build_pod_manifest 
self._set_container(namespace, container_name, container, **kwargs) 
File "/app/scheduler/__init__.py", line 587, in _set_container 
self.set_application_config(namespace, env, kwargs.get('version')) 
File "/app/scheduler/__init__.py", line 562, in set_application_config 
self.create_secret(namespace, secret_name, secrets_env, labels=labels) 
File "/app/scheduler/__init__.py", line 1269, in create_secret 
'failed to create Secret "{}" in Namespace "{}"', name, namespace 
scheduler.KubeHTTPException: ('failed to failed to create Secret "app-v8-env" in Namespace "app": 409 Conflict\n{\'kind\': \'Status\', \'reason\': \'AlreadyExists\', \'apiVersion\': \'v1\', \'message\': \'secrets "app-v8-env" already exists\', \'status\': \'Failure\', \'details\': {\'kind\': \'secrets\', \'name\': \'app-v8-env\'}, \'code\': 409, \'metadata\': {}}', 'app-v8-env', 'app')

Followed by:

ERROR [app]: (app::deploy): There was a problem while deploying v8 of app-web. Going back to the previous release 
ERROR:root:(app::deploy): There was a problem while deploying v8 of app-web. Going back to the previous release 
Traceback (most recent call last): 
File "/app/scheduler/__init__.py", line 560, in set_application_config 
self.get_secret(namespace, secret_name) 
File "/app/scheduler/__init__.py", line 1206, in get_secret 
'get Secret "{}" in Namespace "{}"', name, namespace 
scheduler.KubeHTTPException: ('failed to get Secret "app-v8-env" in Namespace "app": 404 Not Found\n{\'kind\': \'Status\', \'reason\': \'NotFound\', \'apiVersion\': \'v1\', \'message\': \'secrets "app-v8-env" not found\', \'status\': \'Failure\', \'details\': {\'kind\': \'secrets\', \'name\': \'app-v8-env\'}, \'code\': 404, \'metadata\': {}}', 'app-v8-env', 'app')

During handling of the above exception, another exception occurred:

Traceback (most recent call last): 
File "/app/scheduler/__init__.py", line 155, in deploy_deployment 
**kwargs) 
File "/app/scheduler/__init__.py", line 1692, in update_deployment 
**kwargs) 
File "/app/scheduler/__init__.py", line 1854, in _build_deployment_manifest 
manifest['spec']['template'] = self._build_pod_manifest(namespace, name, image, **kwargs) 
File "/app/scheduler/__init__.py", line 461, in _build_pod_manifest 
self._set_container(namespace, container_name, container, **kwargs) 
File "/app/scheduler/__init__.py", line 587, in _set_container 
self.set_application_config(namespace, env, kwargs.get('version')) 
File "/app/scheduler/__init__.py", line 562, in set_application_config 
self.create_secret(namespace, secret_name, secrets_env, labels=labels) 
File "/app/scheduler/__init__.py", line 1269, in create_secret 
'failed to create Secret "{}" in Namespace "{}"', name, namespace 
scheduler.KubeHTTPException: ('failed to failed to create Secret "app-v8-env" in Namespace "app": 409 Conflict\n{\'kind\': \'Status\', \'reason\': \'AlreadyExists\', \'apiVersion\': \'v1\', \'message\': \'secrets "app-v8-env" already exists\', \'status\': \'Failure\', \'details\': {\'kind\': \'secrets\', \'name\': \'app-v8-env\'}, \'code\': 409, \'metadata\': {}}', 'app-v8-env', 'app')

The above exception was the direct cause of the following exception:

Traceback (most recent call last): 
File "/app/api/models/app.py", line 560, in deploy 
async_run(tasks) 
File "/app/api/utils.py", line 161, in async_run 
raise error 
File "/usr/lib/python3.5/concurrent/futures/thread.py", line 55, in run 
result = self.fn(*self.args, **self.kwargs) 
File "/app/scheduler/__init__.py", line 114, in deploy 
self.deploy_deployment(namespace, name, image, entrypoint, command, **kwargs) 
File "/app/scheduler/__init__.py", line 169, in deploy_deployment 
) from e 
scheduler.KubeException: There was a problem while deploying v8 of app-web. Going back to the previous release

The above exception was the direct cause of the following exception:

Traceback (most recent call last): 
File "/app/api/models/build.py", line 64, in create 
self.app.deploy(new_release) 
File "/app/api/models/app.py", line 564, in deploy 
raise ServiceUnavailable(err) from e 
api.exceptions.ServiceUnavailable: (app::deploy): There was a problem while deploying v8 of app-web. Going back to the previous release

The above exception was the direct cause of the following exception:

Traceback (most recent call last): 
File "/usr/local/lib/python3.5/dist-packages/rest_framework/views.py", line 463, in dispatch 
response = handler(request, *args, **kwargs) 
File "/app/api/views.py", line 474, in create 
super(BuildHookViewSet, self).create(request, *args, **kwargs) 
File "/usr/local/lib/python3.5/dist-packages/rest_framework/mixins.py", line 21, in create 
self.perform_create(serializer) 
File "/app/api/viewsets.py", line 21, in perform_create 
self.post_save(obj) 
File "/app/api/views.py", line 480, in post_save 
build.create(self.user) 
File "/app/api/models/build.py", line 71, in create 
raise DeisException(str(e)) from e 
api.exceptions.DeisException: (app::deploy): There was a problem while deploying v8 of app-web. Going back to the previous release 
10.92.2.14 "POST /v2/hooks/build HTTP/1.1" 400 113 "deis-builder" 

I'm marking this high priority so we focus on getting a fix in by Workflow 2.5.

helgi commented 8 years ago

Did you get their workflow version? Did they tell you what operations they were running?

helgi commented 8 years ago

@seanknox do you set any health checks? Since we use deployments (and even before that) a pod should not be terminated until a new one is up but if you have no health checks then none of the systems can do checks to determine when it's safe to kill old pods beyond knowing a new pod is running

helgi commented 8 years ago

@mboersma btw looking at the stacktrace it seems it is still using set_application_config called from within the scheduler, which I changed in 2.4.x to be done in the App model (see https://github.com/deis/controller/commit/855bb59db379dcf5b673c2b2d596662cc2d34fa9), that suggests the user is not on the latest version (which is why I need the version info).

mboersma commented 8 years ago

@helgi you're right--quay.io/deis/controller:v2.3.1 according to the GKE console. I'll see if they are ready to upgrade.

Edit: since there's a workaround I think they are anticipating upgrading to Workflow v2.5, especially if we can close this issue out in that release.

jeroenvisser101 commented 8 years ago

@helgi the stacktrace @mboersma posted is mine, it's from 2.4.1. We'll be upgrading to 2.4.2 soon, and 2.5 after.

helgi commented 8 years ago

@jeroenvisser101 Are you sure? Given the code and what @mboersma said it points to 2.3.x series

jeroenvisser101 commented 8 years ago

@helgi positive. I'll send you the zendesk ticket on slack. negative, we didn't upgrade the controller.

helgi commented 8 years ago

Confirmed with @jeroenvisser101 that controller image is 2.3.1

seanknox commented 8 years ago

@seanknox do you set any health checks? Since we use deployments (and even before that) a pod should not be terminated until a new one is up but if you have no health checks then none of the systems can do checks to determine when it's safe to kill old pods beyond knowing a new pod is running

@helgi I didn't explicitly define healthchecks for this app, but I do see this in Readiness:

$ kubectl describe pod --namespace=perftest perftest-web-2567960716-3z0zq
...
State:          Running
      Started:      Wed, 31 Aug 2016 15:44:12 -0700
    Ready:          True
    Restart Count:      0
    Readiness:      tcp-socket :3000 delay=5s timeout=5s period=5s #success=1 #failure=1

...

Port 3000 is the TCP port the app service is listening on.

seanknox commented 8 years ago

@helgi et all:

Steps to reproduce:

  1. Scale a web process to 1 pod
  2. Begin any kind of deploy: deis pull, git push, config:set, etc.

Initially, a new web pod is rolled out to replace the existing one:

$ kubectl get pods --namespace=perftest                                                   
NAME                               READY     STATUS    RESTARTS   AGE
perftest-web-1834550257-ltk3h      1/1       Running   0          10m
perftest-web-2689598126-qjz8x      0/1       Running   0          1m
perftest-worker-4260140018-ajmfb   1/1       Running   0          1m

Before the new web pod is in service, another pod tied to the previous release is erroneously launched. The "good" replacement pod is still waiting to be ready, while the bad release pod fails to start immediately. The original, ready web pod is terminated prematurely, leaving no pod able to serve traffic:

$ kubectl get pods --namespace=perftest                                                    
NAME                               READY     STATUS              RESTARTS   AGE
perftest-web-1834550257-ltk3h      1/1       Terminating         0          11m
perftest-web-1834550257-xd5lc      0/1       RunContainerError   0          15s
perftest-web-2689598126-qjz8x      0/1       Running             0          2m
perftest-worker-4260140018-ajmfb   1/1       Running             0          2m

FWIW, I set readiness and liveness healthchecks with deis healthchecks:set and this still happens.

$ deis healthchecks:set readiness httpGet 3000 --type web -a perftest
Applying readinessProbe healthcheck... done

=== perftest Healthchecks

web:
--- Liveness
Initial Delay (seconds): 50
Timeout (seconds): 50
Period (seconds): 10
Success Threshold: 1
Failure Threshold: 3
Exec Probe: N/A
HTTP GET Probe: Path="/" Port=3000 HTTPHeaders=[]
TCP Socket Probe: N/A

--- Readiness
Initial Delay (seconds): 50
Timeout (seconds): 50
Period (seconds): 10
Success Threshold: 1
Failure Threshold: 3
Exec Probe: N/A
HTTP GET Probe: Path="/" Port=3000 HTTPHeaders=[]
TCP Socket Probe: N/A

This is a showstopper for us. Let me know if I can provide any additional detail.

helgi commented 8 years ago

Can you describe the pod that is in RunContainerError state

seanknox commented 8 years ago

@helgi sure, and here's some other info:

app in question

This happens on every other deploy:

New web pod has been deployed and is coming up

Name Release State
perftest-web-237897655-exzie v18 Ready (serving traffic)
perftest-web-2719287424-r99ij v19 Starting
$ kubectl get pods --namespace=perftest
NAME                               READY     STATUS    RESTARTS   AGE
perftest-web-237897655-exzie       1/1       Running   0          4m
perftest-web-2719287424-r99ij      0/1       Running   0          1m
perftest-worker-3388053956-lsj4l   1/1       Running   0          1m

New pod isn't ready within 120s, remaining good pod is killed, bad pod scheduled

Name Release State
perftest-web-237897655-exzie v18 Terminating
perftest-web-2719287424-r99ij v19 Starting
perftest-web-237897655-34by9 v18 RunContainerError
$ kubectl get pods --namespace=perftest
NAME                               READY     STATUS              RESTARTS   AGE
perftest-web-237897655-34by9       0/1       RunContainerError   0          11s
perftest-web-237897655-exzie       1/1       Terminating         0          4m
perftest-web-2719287424-r99ij      0/1       Running             0          2m
perftest-worker-3388053956-lsj4l   1/1       Running             0          2m

kubectl describe new-web-pod-valid-release

(This is the new pod that's perpetually stuck in a running, but not ready state)

kubectl describe  --namespace=perftest pod perftest-web-2719287424-r99ij                                      
Name:       perftest-web-2719287424-r99ij
Namespace:      perftest
Node:       ip-172-20-143-164.ec2.internal/172.20.143.164
Start Time:     Wed, 31 Aug 2016 18:12:22 -0700
Labels:     app=perftest
            heritage=deis
            pod-template-hash=2719287424
            type=web
            version=v19
Status:     Running
IP:         100.96.1.8
Controllers:    ReplicaSet/perftest-web-2719287424
Containers:
  perftest-web:
    Container ID:       docker://b0133bedee2a075cdda4fe83d424fd59abe142d4c49dec3e9f7a648476cae91b
    Image:          quay.io/my_org/webapp:97d979f1efe263d7bfb0d1c716c58f4b3f595c55
    Image ID:       docker://sha256:ff55792f9441df165cc21a87c1224575167efff617800dc8a575871eeae5e136
    Port:
    Command:
      /bin/bash
      -c
    Args:
      ./script/start_rails_server.sh
    State:          Running
      Started:      Wed, 31 Aug 2016 18:12:31 -0700
    Ready:          False
    Restart Count:      0
    Readiness:      tcp-socket :3000 delay=5s timeout=5s period=5s #success=1 #failure=1
    Environment Variables:
      AWS_DEFAULT_REGION:           <set to the key 'aws-default-region' in secret 'perftest-v19-env'>
      ...
Conditions:
  Type      Status
  Initialized   True
  Ready         False
  PodScheduled  True
Volumes:
  default-token-gu23s:
    Type:       Secret (a volume populated by a Secret)
    SecretName: default-token-gu23s
QoS Tier:       BestEffort
Events:
  FirstSeen     LastSeen        Count   From                        SubobjectPath           Type        Reason      Message
  ---------     --------        -----   ----                        -------------           --------        ------      -------
  1m        1m          1       {default-scheduler }                                Normal      Scheduled       Successfully assigned perftest-web-2719287424-r99ij to ip-172-20-143-164.ec2.internal
  1m        1m          1       {kubelet ip-172-20-143-164.ec2.internal}        spec.containers{perftest-web}   Normal      Pulling     pulling image "quay.io/my_org/webapp:97d979f1efe263d7bfb0d1c716c58f4b3f595c55"
  1m        1m          1       {kubelet ip-172-20-143-164.ec2.internal}        spec.containers{perftest-web}   Normal      Pulled      Successfully pulled image "quay.io/my_org/webapp:97d979f1efe263d7bfb0d1c716c58f4b3f595c55"
  1m        1m          1       {kubelet ip-172-20-143-164.ec2.internal}        spec.containers{perftest-web}   Normal      Created     Created container with docker id b0133bedee2a
  1m        1m          1       {kubelet ip-172-20-143-164.ec2.internal}        spec.containers{perftest-web}   Normal      Started     Started container with docker id b0133bedee2a
  55s       5s          11      {kubelet ip-172-20-143-164.ec2.internal}        spec.containers{perftest-web}   Warning     Unhealthy       Readiness probe failed: dial tcp 100.96.1.8:3000: getsockopt: connection refused

kubectl describe new-web-pod-bad-release

(This is the erroneously scheduled pod that hits RunContainerError when it can't find secrets from a previous release)

kubectl describe  --namespace=perftest pod perftest-web-237897655-34by9                                        
Name:       perftest-web-237897655-34by9
Namespace:      perftest
Node:       ip-172-20-123-232.ec2.internal/172.20.123.232
Start Time:     Wed, 31 Aug 2016 18:14:29 -0700
Labels:     app=perftest
            heritage=deis
            pod-template-hash=237897655
            type=web
            version=v18
Status:     Pending
IP:         100.96.0.7
Controllers:    ReplicaSet/perftest-web-237897655
Containers:
  perftest-web:
    Container ID:
    Image:          quay.io/my_org/webapp:1ec7e18206df943090e411b3c1e4791ce4f22c88
    Image ID:
    Port:
    Command:
      /bin/bash
      -c
    Args:
      ./script/start_rails_server.sh
    State:          Waiting
      Reason:       RunContainerError
    Ready:          False
    Restart Count:      0
    Readiness:      tcp-socket :3000 delay=5s timeout=5s period=5s #success=1 #failure=1
    Environment Variables:
      AWS_DEFAULT_REGION:           <set to the key 'aws-default-region' in secret 'perftest-v18-env'>
     ...
Conditions:
  Type      Status
  Initialized   True
  Ready         False
  PodScheduled  True
Volumes:
  default-token-gu23s:
    Type:       Secret (a volume populated by a Secret)
    SecretName: default-token-gu23s
QoS Tier:       BestEffort
Events:
  FirstSeen     LastSeen        Count   From                        SubobjectPath           Type        Reason      Message
  ---------     --------        -----   ----                        -------------           --------        ------      -------
  39s       39s         1       {default-scheduler }                                Normal      Scheduled       Successfully assigned perftest-web-237897655-34by9 to ip-172-20-123-232.ec2.internal
  38s       9s          4       {kubelet ip-172-20-123-232.ec2.internal}        spec.containers{perftest-web}   Normal      Pulling     pulling image "quay.io/my_org/webapp:1ec7e18206df943090e411b3c1e4791ce4f22c88"
  37s       8s          4       {kubelet ip-172-20-123-232.ec2.internal}        spec.containers{perftest-web}   Normal      Pulled      Successfully pulled image "quay.io/my_org/webapp:1ec7e18206df943090e411b3c1e4791ce4f22c88"
  37s       8s          4       {kubelet ip-172-20-123-232.ec2.internal}                        Warning     FailedSync      Error syncing pod, skipping: failed to "StartContainer" for "perftest-web" with RunContainerError: "GenerateRunContainerOptions: secrets \"perftest-v18-env\" not found"

15+ minutes later, the app still hasn't recovered and is unreachable

$ kubectl get pods --namespace=perftest
NAME                               READY     STATUS    RESTARTS   AGE
perftest-web-237897655-exzie       1/1       Running   0          4m
perftest-web-2719287424-r99ij      0/1       Running   0          1m
perftest-worker-3388053956-lsj4l   1/1       Running   0          1m

$ kubectl get pods --namespace=perftest                        
NAME                               READY     STATUS              RESTARTS   AGE
perftest-web-237897655-34by9       0/1       RunContainerError   0          11s
perftest-web-237897655-exzie       1/1       Terminating         0          4m
perftest-web-2719287424-r99ij      0/1       Running             0          2m
perftest-worker-3388053956-lsj4l   1/1       Running             0          2m

$ kubectl get pods --namespace=perftest                         
NAME                               READY     STATUS              RESTARTS   AGE
perftest-web-237897655-34by9       0/1       RunContainerError   0          5m
perftest-web-2719287424-r99ij      0/1       Running             0          7m
perftest-worker-3388053956-lsj4l   1/1       Running

$ kubectl get pods --namespace=perftest                       
NAME                               READY     STATUS              RESTARTS   AGE
perftest-web-237897655-34by9       0/1       RunContainerError   0          17m
perftest-web-2719287424-r99ij      0/1       Running             0          19m
perftest-worker-3388053956-lsj4l   1/1       Running             0          19m
felixbuenemann commented 8 years ago

I don't quite see how the change by @kamals relates to this issue.

If the API fails unexpectedly after the kubernetes deployment has finished, but before the new release has been recorded in the database it will still reuse the same release version on the next deploy, even though it's already in use or am I missing something?

kmala commented 8 years ago

there are two issues in this PR...My PR fixes https://github.com/deis/controller/issues/989#issuecomment-243952457 issue. I think your issue is totally different where you are trying to do deploy after a failed deploy but before the failed deploy completing totally(on the server side) but the cli may return after a specific time because of which the new deploy is returning prematurely.

felixbuenemann commented 8 years ago

I was hitting this problem in v2.3.0 with the secrets race condition, I haven't run into this since.

I think the best idea would be to never re-use a deployment version if the deployment fails and instead continue with the next number on the next deploy. Deis could record the status of the failed deploy in the database and list it as failed when calling deis releases.

kmala commented 8 years ago

yes we are evaluating on using a different release numbers and have status as failed deploy and also make sure that user can't rollback to this specific release. the only time when this issue occurs is when the rollback didn't complete properly and a new deploy happens in the mean time.