Closed felixbuenemann closed 8 years ago
I can confirm this on my v2.4.1 cluster running a v2.4.2 controller image. Ultimately the wrong pods are deployed, which fail to start trying to find secrets that presumably don't exist anymore:
{kubelet ip-172-20-123-232.ec2.internal} Warning FailedSync Error syncing pod, skipping: failed to "StartContainer" for "acceptance-web" with RunContainerError: "GenerateRunContainerOptions: secrets \"acceptance-v54-env\" not found"
In v2.4.1 processes had to be manually scaled down and back up to be corrected, but v2.4.2 seems to do the right thing on subsequent (successful) deploys.
Was the app already in a broken state due to missing secret envs or you hitting 409
prior to doing the upgrade?
@helgi Personally, I was seeing 409 errors but I'm no longer seeing them on v2.4.2, but saw the wrong pod release deployed. Here I had just deployed v28, but a v27 pod is spun up:
$ kubectl get pods --namespace=demo (master)
NAME READY STATUS RESTARTS AGE
demo-web-3789243348-pgudd 0/1 ContainerCreating 0 1m
demo-web-3789243348-tbgkp 1/1 Running 0 1m
demo-web-754923408-1lhg6 1/1 Running 0 35m
demo-worker-1879168280-s7cv4 1/1 Running 0 1m
$ kubectl get pods --namespace=demo (master)
NAME READY STATUS RESTARTS AGE
demo-web-3789243348-pgudd 0/1 Running 0 4m
demo-web-3789243348-tbgkp 1/1 Running 0 4m
demo-web-754923408-wsgrp 0/1 RunContainerError 0 1m
demo-worker-1879168280-s7cv4 1/1 Running 0 4m
A describe on the problem pod shows that it's the previous version (v27, instead of v28):
$ kubectl describe --namespace=demo pod demo-web-754923408-wsgrp (master)
Name: demo-web-754923408-wsgrp
Namespace: demo
Node: ip-172-20-143-164.ec2.internal/172.20.143.164
Start Time: Mon, 29 Aug 2016 15:42:03 -0700
Labels: app=demo
heritage=deis
pod-template-hash=754923408
type=web
version=v27
Status: Pending
IP: 100.96.1.19
Controllers: ReplicaSet/demo-web-754923408
Containers:
demo-web:
Container ID:
Image: localhost:5555/demo:git-2a5a6d89
Image ID:
Port:
Command:
/bin/bash
-c
Args:
./script/start_rails_server.sh
State: Waiting
Reason: RunContainerError
Ready: False
Restart Count: 0
Readiness: tcp-socket :3000 delay=5s timeout=5s period=5s #success=1 #failure=1
Environment Variables:
AWS_DEFAULT_REGION: <set to the key 'aws-default-region' in secret 'demo-v27-env'>
...
Conditions:
Type Status
Initialized True
Ready False
PodScheduled True
Volumes:
default-token-2povj:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-2povj
QoS Tier: BestEffort
Events:
FirstSeen LastSeen Count From SubobjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
2m 2m 1 {default-scheduler } Normal Scheduled Successfully assigned demo-web-754923408-wsgrp to ip-172-20-143-164.ec2.internal
1m 11s 8 {kubelet ip-172-20-143-164.ec2.internal} spec.containers{demo-web} Normal Pulling pulling image "localhost:5555/demo:git-2a5a6d89"
1m 10s 8 {kubelet ip-172-20-143-164.ec2.internal} spec.containers{demo-web} Normal Pulled Successfully pulled image "localhost:5555/demo:git-2a5a6d89"
1m 10s 8 {kubelet ip-172-20-143-164.ec2.internal} Warning FailedSync Error syncing pod, skipping: failed to "StartContainer" for "demo-web" with RunContainerError: "GenerateRunContainerOptions: secrets \"demo-v27-env\" not found"
@helgi The above release was a new image; I'm seeing the same behavior using deis config:set
. Erroneous pods are deployed, fail to start, and then are cleaned up.
One other tidbit: apps with a single app process end up terminating the working pod prematurely; the bad pod then fails to start, requiring manually down to zero and back up to restore. It looks like running at least two web pods protects against this scenario.
Another user has reported what appears to be the same scenario:
Traceback (most recent call last):
File "/app/scheduler/__init__.py", line 155, in deploy_deployment
**kwargs)
File "/app/scheduler/__init__.py", line 1692, in update_deployment
**kwargs)
File "/app/scheduler/__init__.py", line 1854, in _build_deployment_manifest
manifest['spec']['template'] = self._build_pod_manifest(namespace, name, image, **kwargs)
File "/app/scheduler/__init__.py", line 461, in _build_pod_manifest
self._set_container(namespace, container_name, container, **kwargs)
File "/app/scheduler/__init__.py", line 587, in _set_container
self.set_application_config(namespace, env, kwargs.get('version'))
File "/app/scheduler/__init__.py", line 562, in set_application_config
self.create_secret(namespace, secret_name, secrets_env, labels=labels)
File "/app/scheduler/__init__.py", line 1269, in create_secret
'failed to create Secret "{}" in Namespace "{}"', name, namespace
scheduler.KubeHTTPException: ('failed to failed to create Secret "app-v8-env" in Namespace "app": 409 Conflict\n{\'kind\': \'Status\', \'reason\': \'AlreadyExists\', \'apiVersion\': \'v1\', \'message\': \'secrets "app-v8-env" already exists\', \'status\': \'Failure\', \'details\': {\'kind\': \'secrets\', \'name\': \'app-v8-env\'}, \'code\': 409, \'metadata\': {}}', 'app-v8-env', 'app')
Followed by:
ERROR [app]: (app::deploy): There was a problem while deploying v8 of app-web. Going back to the previous release
ERROR:root:(app::deploy): There was a problem while deploying v8 of app-web. Going back to the previous release
Traceback (most recent call last):
File "/app/scheduler/__init__.py", line 560, in set_application_config
self.get_secret(namespace, secret_name)
File "/app/scheduler/__init__.py", line 1206, in get_secret
'get Secret "{}" in Namespace "{}"', name, namespace
scheduler.KubeHTTPException: ('failed to get Secret "app-v8-env" in Namespace "app": 404 Not Found\n{\'kind\': \'Status\', \'reason\': \'NotFound\', \'apiVersion\': \'v1\', \'message\': \'secrets "app-v8-env" not found\', \'status\': \'Failure\', \'details\': {\'kind\': \'secrets\', \'name\': \'app-v8-env\'}, \'code\': 404, \'metadata\': {}}', 'app-v8-env', 'app')
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/app/scheduler/__init__.py", line 155, in deploy_deployment
**kwargs)
File "/app/scheduler/__init__.py", line 1692, in update_deployment
**kwargs)
File "/app/scheduler/__init__.py", line 1854, in _build_deployment_manifest
manifest['spec']['template'] = self._build_pod_manifest(namespace, name, image, **kwargs)
File "/app/scheduler/__init__.py", line 461, in _build_pod_manifest
self._set_container(namespace, container_name, container, **kwargs)
File "/app/scheduler/__init__.py", line 587, in _set_container
self.set_application_config(namespace, env, kwargs.get('version'))
File "/app/scheduler/__init__.py", line 562, in set_application_config
self.create_secret(namespace, secret_name, secrets_env, labels=labels)
File "/app/scheduler/__init__.py", line 1269, in create_secret
'failed to create Secret "{}" in Namespace "{}"', name, namespace
scheduler.KubeHTTPException: ('failed to failed to create Secret "app-v8-env" in Namespace "app": 409 Conflict\n{\'kind\': \'Status\', \'reason\': \'AlreadyExists\', \'apiVersion\': \'v1\', \'message\': \'secrets "app-v8-env" already exists\', \'status\': \'Failure\', \'details\': {\'kind\': \'secrets\', \'name\': \'app-v8-env\'}, \'code\': 409, \'metadata\': {}}', 'app-v8-env', 'app')
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/app/api/models/app.py", line 560, in deploy
async_run(tasks)
File "/app/api/utils.py", line 161, in async_run
raise error
File "/usr/lib/python3.5/concurrent/futures/thread.py", line 55, in run
result = self.fn(*self.args, **self.kwargs)
File "/app/scheduler/__init__.py", line 114, in deploy
self.deploy_deployment(namespace, name, image, entrypoint, command, **kwargs)
File "/app/scheduler/__init__.py", line 169, in deploy_deployment
) from e
scheduler.KubeException: There was a problem while deploying v8 of app-web. Going back to the previous release
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/app/api/models/build.py", line 64, in create
self.app.deploy(new_release)
File "/app/api/models/app.py", line 564, in deploy
raise ServiceUnavailable(err) from e
api.exceptions.ServiceUnavailable: (app::deploy): There was a problem while deploying v8 of app-web. Going back to the previous release
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/rest_framework/views.py", line 463, in dispatch
response = handler(request, *args, **kwargs)
File "/app/api/views.py", line 474, in create
super(BuildHookViewSet, self).create(request, *args, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/rest_framework/mixins.py", line 21, in create
self.perform_create(serializer)
File "/app/api/viewsets.py", line 21, in perform_create
self.post_save(obj)
File "/app/api/views.py", line 480, in post_save
build.create(self.user)
File "/app/api/models/build.py", line 71, in create
raise DeisException(str(e)) from e
api.exceptions.DeisException: (app::deploy): There was a problem while deploying v8 of app-web. Going back to the previous release
10.92.2.14 "POST /v2/hooks/build HTTP/1.1" 400 113 "deis-builder"
I'm marking this high priority so we focus on getting a fix in by Workflow 2.5.
Did you get their workflow version? Did they tell you what operations they were running?
@seanknox do you set any health checks? Since we use deployments (and even before that) a pod should not be terminated until a new one is up but if you have no health checks then none of the systems can do checks to determine when it's safe to kill old pods beyond knowing a new pod is running
@mboersma btw looking at the stacktrace it seems it is still using set_application_config
called from within the scheduler
, which I changed in 2.4.x
to be done in the App
model (see https://github.com/deis/controller/commit/855bb59db379dcf5b673c2b2d596662cc2d34fa9), that suggests the user is not on the latest version (which is why I need the version info).
@helgi you're right--quay.io/deis/controller:v2.3.1
according to the GKE console. I'll see if they are ready to upgrade.
Edit: since there's a workaround I think they are anticipating upgrading to Workflow v2.5, especially if we can close this issue out in that release.
@helgi the stacktrace @mboersma posted is mine, it's from 2.4.1. We'll be upgrading to 2.4.2 soon, and 2.5 after.
@jeroenvisser101 Are you sure? Given the code and what @mboersma said it points to 2.3.x series
@helgi positive. I'll send you the zendesk ticket on slack. negative, we didn't upgrade the controller.
Confirmed with @jeroenvisser101 that controller image is 2.3.1
@seanknox do you set any health checks? Since we use deployments (and even before that) a pod should not be terminated until a new one is up but if you have no health checks then none of the systems can do checks to determine when it's safe to kill old pods beyond knowing a new pod is running
@helgi I didn't explicitly define healthchecks for this app, but I do see this in Readiness
:
$ kubectl describe pod --namespace=perftest perftest-web-2567960716-3z0zq
...
State: Running
Started: Wed, 31 Aug 2016 15:44:12 -0700
Ready: True
Restart Count: 0
Readiness: tcp-socket :3000 delay=5s timeout=5s period=5s #success=1 #failure=1
...
Port 3000 is the TCP port the app service is listening on.
@helgi et all:
Steps to reproduce:
deis pull
, git push
, config:set
, etc.Initially, a new web pod is rolled out to replace the existing one:
$ kubectl get pods --namespace=perftest
NAME READY STATUS RESTARTS AGE
perftest-web-1834550257-ltk3h 1/1 Running 0 10m
perftest-web-2689598126-qjz8x 0/1 Running 0 1m
perftest-worker-4260140018-ajmfb 1/1 Running 0 1m
Before the new web pod is in service, another pod tied to the previous release is erroneously launched. The "good" replacement pod is still waiting to be ready, while the bad release pod fails to start immediately. The original, ready web pod is terminated prematurely, leaving no pod able to serve traffic:
$ kubectl get pods --namespace=perftest
NAME READY STATUS RESTARTS AGE
perftest-web-1834550257-ltk3h 1/1 Terminating 0 11m
perftest-web-1834550257-xd5lc 0/1 RunContainerError 0 15s
perftest-web-2689598126-qjz8x 0/1 Running 0 2m
perftest-worker-4260140018-ajmfb 1/1 Running 0 2m
FWIW, I set readiness and liveness healthchecks with deis healthchecks:set
and this still happens.
$ deis healthchecks:set readiness httpGet 3000 --type web -a perftest
Applying readinessProbe healthcheck... done
=== perftest Healthchecks
web:
--- Liveness
Initial Delay (seconds): 50
Timeout (seconds): 50
Period (seconds): 10
Success Threshold: 1
Failure Threshold: 3
Exec Probe: N/A
HTTP GET Probe: Path="/" Port=3000 HTTPHeaders=[]
TCP Socket Probe: N/A
--- Readiness
Initial Delay (seconds): 50
Timeout (seconds): 50
Period (seconds): 10
Success Threshold: 1
Failure Threshold: 3
Exec Probe: N/A
HTTP GET Probe: Path="/" Port=3000 HTTPHeaders=[]
TCP Socket Probe: N/A
This is a showstopper for us. Let me know if I can provide any additional detail.
Can you describe the pod that is in RunContainerError
state
@helgi sure, and here's some other info:
This happens on every other deploy:
RunContainerError
. Name | Release | State |
---|---|---|
perftest-web-237897655-exzie | v18 | Ready (serving traffic) |
perftest-web-2719287424-r99ij | v19 | Starting |
$ kubectl get pods --namespace=perftest
NAME READY STATUS RESTARTS AGE
perftest-web-237897655-exzie 1/1 Running 0 4m
perftest-web-2719287424-r99ij 0/1 Running 0 1m
perftest-worker-3388053956-lsj4l 1/1 Running 0 1m
Name | Release | State |
---|---|---|
perftest-web-237897655-exzie | v18 | Terminating |
perftest-web-2719287424-r99ij | v19 | Starting |
perftest-web-237897655-34by9 | v18 | RunContainerError |
$ kubectl get pods --namespace=perftest
NAME READY STATUS RESTARTS AGE
perftest-web-237897655-34by9 0/1 RunContainerError 0 11s
perftest-web-237897655-exzie 1/1 Terminating 0 4m
perftest-web-2719287424-r99ij 0/1 Running 0 2m
perftest-worker-3388053956-lsj4l 1/1 Running 0 2m
(This is the new pod that's perpetually stuck in a running, but not ready state)
kubectl describe --namespace=perftest pod perftest-web-2719287424-r99ij
Name: perftest-web-2719287424-r99ij
Namespace: perftest
Node: ip-172-20-143-164.ec2.internal/172.20.143.164
Start Time: Wed, 31 Aug 2016 18:12:22 -0700
Labels: app=perftest
heritage=deis
pod-template-hash=2719287424
type=web
version=v19
Status: Running
IP: 100.96.1.8
Controllers: ReplicaSet/perftest-web-2719287424
Containers:
perftest-web:
Container ID: docker://b0133bedee2a075cdda4fe83d424fd59abe142d4c49dec3e9f7a648476cae91b
Image: quay.io/my_org/webapp:97d979f1efe263d7bfb0d1c716c58f4b3f595c55
Image ID: docker://sha256:ff55792f9441df165cc21a87c1224575167efff617800dc8a575871eeae5e136
Port:
Command:
/bin/bash
-c
Args:
./script/start_rails_server.sh
State: Running
Started: Wed, 31 Aug 2016 18:12:31 -0700
Ready: False
Restart Count: 0
Readiness: tcp-socket :3000 delay=5s timeout=5s period=5s #success=1 #failure=1
Environment Variables:
AWS_DEFAULT_REGION: <set to the key 'aws-default-region' in secret 'perftest-v19-env'>
...
Conditions:
Type Status
Initialized True
Ready False
PodScheduled True
Volumes:
default-token-gu23s:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-gu23s
QoS Tier: BestEffort
Events:
FirstSeen LastSeen Count From SubobjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
1m 1m 1 {default-scheduler } Normal Scheduled Successfully assigned perftest-web-2719287424-r99ij to ip-172-20-143-164.ec2.internal
1m 1m 1 {kubelet ip-172-20-143-164.ec2.internal} spec.containers{perftest-web} Normal Pulling pulling image "quay.io/my_org/webapp:97d979f1efe263d7bfb0d1c716c58f4b3f595c55"
1m 1m 1 {kubelet ip-172-20-143-164.ec2.internal} spec.containers{perftest-web} Normal Pulled Successfully pulled image "quay.io/my_org/webapp:97d979f1efe263d7bfb0d1c716c58f4b3f595c55"
1m 1m 1 {kubelet ip-172-20-143-164.ec2.internal} spec.containers{perftest-web} Normal Created Created container with docker id b0133bedee2a
1m 1m 1 {kubelet ip-172-20-143-164.ec2.internal} spec.containers{perftest-web} Normal Started Started container with docker id b0133bedee2a
55s 5s 11 {kubelet ip-172-20-143-164.ec2.internal} spec.containers{perftest-web} Warning Unhealthy Readiness probe failed: dial tcp 100.96.1.8:3000: getsockopt: connection refused
(This is the erroneously scheduled pod that hits RunContainerError
when it can't find secrets from a previous release)
kubectl describe --namespace=perftest pod perftest-web-237897655-34by9
Name: perftest-web-237897655-34by9
Namespace: perftest
Node: ip-172-20-123-232.ec2.internal/172.20.123.232
Start Time: Wed, 31 Aug 2016 18:14:29 -0700
Labels: app=perftest
heritage=deis
pod-template-hash=237897655
type=web
version=v18
Status: Pending
IP: 100.96.0.7
Controllers: ReplicaSet/perftest-web-237897655
Containers:
perftest-web:
Container ID:
Image: quay.io/my_org/webapp:1ec7e18206df943090e411b3c1e4791ce4f22c88
Image ID:
Port:
Command:
/bin/bash
-c
Args:
./script/start_rails_server.sh
State: Waiting
Reason: RunContainerError
Ready: False
Restart Count: 0
Readiness: tcp-socket :3000 delay=5s timeout=5s period=5s #success=1 #failure=1
Environment Variables:
AWS_DEFAULT_REGION: <set to the key 'aws-default-region' in secret 'perftest-v18-env'>
...
Conditions:
Type Status
Initialized True
Ready False
PodScheduled True
Volumes:
default-token-gu23s:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-gu23s
QoS Tier: BestEffort
Events:
FirstSeen LastSeen Count From SubobjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
39s 39s 1 {default-scheduler } Normal Scheduled Successfully assigned perftest-web-237897655-34by9 to ip-172-20-123-232.ec2.internal
38s 9s 4 {kubelet ip-172-20-123-232.ec2.internal} spec.containers{perftest-web} Normal Pulling pulling image "quay.io/my_org/webapp:1ec7e18206df943090e411b3c1e4791ce4f22c88"
37s 8s 4 {kubelet ip-172-20-123-232.ec2.internal} spec.containers{perftest-web} Normal Pulled Successfully pulled image "quay.io/my_org/webapp:1ec7e18206df943090e411b3c1e4791ce4f22c88"
37s 8s 4 {kubelet ip-172-20-123-232.ec2.internal} Warning FailedSync Error syncing pod, skipping: failed to "StartContainer" for "perftest-web" with RunContainerError: "GenerateRunContainerOptions: secrets \"perftest-v18-env\" not found"
$ kubectl get pods --namespace=perftest
NAME READY STATUS RESTARTS AGE
perftest-web-237897655-exzie 1/1 Running 0 4m
perftest-web-2719287424-r99ij 0/1 Running 0 1m
perftest-worker-3388053956-lsj4l 1/1 Running 0 1m
$ kubectl get pods --namespace=perftest
NAME READY STATUS RESTARTS AGE
perftest-web-237897655-34by9 0/1 RunContainerError 0 11s
perftest-web-237897655-exzie 1/1 Terminating 0 4m
perftest-web-2719287424-r99ij 0/1 Running 0 2m
perftest-worker-3388053956-lsj4l 1/1 Running 0 2m
$ kubectl get pods --namespace=perftest
NAME READY STATUS RESTARTS AGE
perftest-web-237897655-34by9 0/1 RunContainerError 0 5m
perftest-web-2719287424-r99ij 0/1 Running 0 7m
perftest-worker-3388053956-lsj4l 1/1 Running
$ kubectl get pods --namespace=perftest
NAME READY STATUS RESTARTS AGE
perftest-web-237897655-34by9 0/1 RunContainerError 0 17m
perftest-web-2719287424-r99ij 0/1 Running 0 19m
perftest-worker-3388053956-lsj4l 1/1 Running 0 19m
I don't quite see how the change by @kamals relates to this issue.
If the API fails unexpectedly after the kubernetes deployment has finished, but before the new release has been recorded in the database it will still reuse the same release version on the next deploy, even though it's already in use or am I missing something?
there are two issues in this PR...My PR fixes https://github.com/deis/controller/issues/989#issuecomment-243952457 issue. I think your issue is totally different where you are trying to do deploy after a failed deploy but before the failed deploy completing totally(on the server side) but the cli may return after a specific time because of which the new deploy is returning prematurely.
I was hitting this problem in v2.3.0 with the secrets race condition, I haven't run into this since.
I think the best idea would be to never re-use a deployment version if the deployment fails and instead continue with the next number on the next deploy. Deis could record the status of the failed deploy in the database and list it as failed when calling deis releases
.
yes we are evaluating on using a different release numbers and have status as failed deploy and also make sure that user can't rollback to this specific release. the only time when this issue occurs is when the rollback didn't complete properly and a new deploy happens in the mean time.
Due to the problem fixed by #946 deployments on our deis v2.3.0 cluster were failing most of the time, but not always.
Because version numbers are sequentially inremented and a failed deploy does not skip that bversion number, but re-uses it for the next deploy, it happened that a new deployment referenced pods from a previous failed deployment with the same version number.
Example:
deis releases
localhost:5555/cms:git-7c1d5e02
from the previous failed release, which was also tagged withversion=v50
so it fit the selectors of the new deploymentThis ended up with the following replica set on v50:
Controller Log on deis-controller v2.3.0:
Failed v50 release (gitlab deployed 7c1d5e0):
Successful v50 release (gitlab deployed 4e0a908):
The above log shows that it re-used the previous failed v50 release.
Output from
deis releases
:Git history: