Closed richarddli closed 5 years ago
Log files above.
root@ip-10-11-213-129:/opt/kube-control-files# kubectl describe pod ambassador-fc89df746-xp4xp
Name: ambassador-fc89df746-xp4xp
Namespace: default
Priority: 0
PriorityClassName: <none>
Node: ip-10-11-181-23.ec2.internal/10.11.181.23
Start Time: Thu, 02 Aug 2018 07:36:18 -0700
Labels: pod-template-hash=974589302
service=ambassador
Annotations: sumologic.com/sourceCategory=demo/kubeadm/webservice
sumologic.com/sourceName=ambassador
Status: Running
IP: 10.244.10.84
Controlled By: ReplicaSet/ambassador-fc89df746
Containers:
ambassador:
Container ID: docker://f2eeea0ac931ad51e9af993899efc64db649db63e6df9475415ae786231f6e41
Image: quay.io/datawire/ambassador:0.37.0
Image ID: docker-pullable://quay.io/datawire/ambassador@sha256:54b8cf9d1294b91e255f275efe9ddd33544edd29325e02e606c00cf4e5304154
Port: <none>
Host Port: <none>
State: Running
Started: Thu, 02 Aug 2018 07:39:22 -0700
Last State: Terminated
Reason: Error
Exit Code: 137
Started: Thu, 02 Aug 2018 07:36:25 -0700
Finished: Thu, 02 Aug 2018 07:39:21 -0700
Ready: False
Restart Count: 1
Limits:
cpu: 1
memory: 400Mi
Requests:
cpu: 200m
memory: 100Mi
Liveness: http-get http://:8877/ambassador/v0/check_alive delay=120s timeout=1s period=3s #success=1 #failure=3
Readiness: http-get http://:8877/ambassador/v0/check_ready delay=120s timeout=1s period=3s #success=1 #failure=3
Environment:
AMBASSADOR_NAMESPACE: default (v1:metadata.namespace)
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from ambassador-token-fjdvg (ro)
statsd:
Container ID: docker://0421b9fea9006d23e2245284a5506f12a76ec2735469da6962c86b8079097f01
Image: quay.io/datawire/statsd:0.37.0
Image ID: docker-pullable://quay.io/datawire/statsd@sha256:3fa72d34732362dc2fa9f88c52f816d51e70ef790f85261a841c9c64c6fefc24
Port: <none>
Host Port: <none>
State: Running
Started: Thu, 02 Aug 2018 07:36:27 -0700
Ready: True
Restart Count: 0
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from ambassador-token-fjdvg (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
ambassador-token-fjdvg:
Type: Secret (a volume populated by a Secret)
SecretName: ambassador-token-fjdvg
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 3m default-scheduler Successfully assigned default/ambassador-fc89df746-xp4xp to ip-10-11-181-23.ec2.internal
Normal Pulled 3m kubelet, ip-10-11-181-23.ec2.internal Successfully pulled image "quay.io/datawire/ambassador:0.37.0"
Normal Created 3m kubelet, ip-10-11-181-23.ec2.internal Created container
Normal Started 3m kubelet, ip-10-11-181-23.ec2.internal Started container
Normal Pulling 3m kubelet, ip-10-11-181-23.ec2.internal pulling image "quay.io/datawire/statsd:0.37.0"
Normal Pulled 3m kubelet, ip-10-11-181-23.ec2.internal Successfully pulled image "quay.io/datawire/statsd:0.37.0"
Normal Created 3m kubelet, ip-10-11-181-23.ec2.internal Created container
Normal Started 3m kubelet, ip-10-11-181-23.ec2.internal Started container
Warning Unhealthy 1m (x3 over 1m) kubelet, ip-10-11-181-23.ec2.internal Liveness probe failed: Get http://10.244.10.84:8877/ambassador/v0/check_alive: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Warning Unhealthy 36s (x12 over 1m) kubelet, ip-10-11-181-23.ec2.internal Readiness probe failed: Get http://10.244.10.84:8877/ambassador/v0/check_ready: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Normal Pulling 35s (x2 over 3m) kubelet, ip-10-11-181-23.ec2.internal pulling image "quay.io/datawire/ambassador:0.37.0"
Normal Killing 35s kubelet, ip-10-11-181-23.ec2.internal Killing container with id docker://ambassador:Container failed liveness probe.. Container will be killed and recreated.
On Slack, the person who ran into this is "Deep"
I ran into the same issue it seems:
Versions:
Ambassador: 0.35.2
Kubernetes environment: kubeadm
installed cluster on Digital Ocean
Version: v1.11.0
Node info:
"NodeInfo": {
"KernelVersion": "4.4.0-130-generic",
"OSImage": "Ubuntu 16.04.4 LTS",
"ContainerRuntimeVersion": "docker://1.13.1",
"KubeletVersion": "v1.11.0",
"KubeProxyVersion": "v1.11.0",
"OperatingSystem": "linux",
"Architecture": "amd64"
}
kubectl describe pod ambassador-7c7d9b85c9-clxv5
:
Name: ambassador-7c7d9b85c9-clxv5
Namespace: default
Node: service-worker-01
Start Time: Mon, 09 Jul 2018 16:09:42 +0200
Labels: pod-template-hash=3738564175
service=ambassador
Annotations: sidecar.istio.io/inject=false
Status: Running
IP: 10.244.1.148
Controlled By: ReplicaSet/ambassador-7c7d9b85c9
Containers:
ambassador:
Container ID: docker://0bee51e2f187d76bb5e6a4a445ab150a71cbede534c8f71c67f8be357da6aa3e
Image: quay.io/datawire/ambassador:0.35.2
Image ID: docker-pullable://quay.io/datawire/ambassador@sha256:90e930f427733664e7d6a1e03efdc75a324754b22b365aa677f257b496d204a1
Port: <none>
Host Port: <none>
State: Running
Started: Tue, 07 Aug 2018 14:14:22 +0200
Last State: Terminated
Reason: Error
Exit Code: 137
Started: Tue, 07 Aug 2018 14:04:47 +0200
Finished: Tue, 07 Aug 2018 14:14:22 +0200
Ready: True
Restart Count: 26
Limits:
cpu: 1
memory: 400Mi
Requests:
cpu: 200m
memory: 100Mi
Liveness: http-get http://:8877/ambassador/v0/check_alive delay=30s timeout=1s period=3s #success=1 #failure=3
Readiness: http-get http://:8877/ambassador/v0/check_ready delay=30s timeout=1s period=3s #success=1 #failure=3
Environment:
AMBASSADOR_NAMESPACE: default (v1:metadata.namespace)
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from ambassador-token-qjh69 (ro)
statsd:
Container ID: docker://568c947734414371a3d8cfa456d9d573947c5c64d017451a6301700191177007
Image: quay.io/datawire/statsd:0.35.2
Image ID: docker-pullable://quay.io/datawire/statsd@sha256:222e856cc729b1988c159d454b440d3cc97e65d2e48ef289b982ce44d8f00142
Port: <none>
Host Port: <none>
State: Running
Started: Wed, 11 Jul 2018 11:44:22 +0200
Last State: Terminated
Reason: Error
Exit Code: 143
Started: Mon, 09 Jul 2018 16:10:05 +0200
Finished: Wed, 11 Jul 2018 11:40:42 +0200
Ready: True
Restart Count: 1
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from ambassador-token-qjh69 (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
ambassador-token-qjh69:
Type: Secret (a volume populated by a Secret)
SecretName: ambassador-token-qjh69
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Unhealthy 57m (x55 over 5h) kubelet, service-worker-01 Liveness probe failed: Get http://10.244.1.148:8877/ambassador/v0/check_alive: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Warning Unhealthy 56m (x69 over 26d) kubelet, service-worker-01 Liveness probe failed: Get http://10.244.1.148:8877/ambassador/v0/check_alive: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Warning Unhealthy 46m (x150 over 5h) kubelet, service-worker-01 Readiness probe failed: Get http://10.244.1.148:8877/ambassador/v0/check_ready: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Link to log: https://datawire-oss.slack.com/files/UBK0N1GNS/FC48SSM0V/-.txt
Unable to reproduce on 1.11.0 or 1.11.2 on minikube with Ambassador 0.38.0, 0.37.0, 0.35.2 :man_facepalming: @Octopixell any specific instructions on how to reproduce this?
@containscafeine - what were your exact steps to reproduce?
My steps were the following:
If you need specific kube configs for the various services used that would be possible to share as well.
Okay, just tried on DigitalOcean + kubeadm + flannel, everything works fine for me :thinking: -
# kubectl version
Client Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.1", GitCommit:"b1b29978270dc22fecc592ac55d903350454310a", GitTreeState:"clean", BuildDate:"2018-07-17T18:53:20Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.2", GitCommit:"bb9ffb1654d4a729bb4cec18ff088eacc153c239", GitTreeState:"clean", BuildDate:"2018-08-07T23:08:19Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}
Update - Nothing breaks on AWS + kubeadm + flannel either. Trying playbooks by @Octopixell now.
Bummer!! Not sure how to take this further - are there any logs which I could collect to help narrow this issue?
@Octopixell were you able to access the readiness/liveness probes via your browser like @ddeep2007?
Given that the probes are accessible via browser, it does seem to suggest some sort of routing issue.
@richarddli sporadically yes but not all the time.
So, @Octopixell helped with debugging this on Slack. Here's what he found out -
NAME READY STATUS RESTARTS AGE
ambassador-778f8764fc-6bkr8 2/2 Running 0 56m
ambassador-778f8764fc-htmsv 1/2 Running 0 5m
ambassador-778f8764fc-xp5xj 1/2 Running 0 5m
I, however, have not been able to reproduce this exactly. I do not get crashing ambassador pods, but I do see increased coredns CPU usage as reported in #568.
I see the same messages in the statsd log referenced in #568 and #465.
2018/08/02 18:05:36 socat[4919] E getaddrinfo("statsd-sink", "NULL", {1,0,2,17}, {}): Name does not resolve```
@ddeep2007 if you delete the statsd
container from your deployment does the issue go away?
After removing the statsd container from the deployment the liveness and readiness probe are still failing when there was a single replica running. When I upped the replicas to 3 I see all 3 pods get into the ready state correctly (took a long time to get READY) but the liveness and readiness still fail on one of the pods.
Actually I think the problem isn't gone at all. I am still observing restarts -
ambassador-f6846547-cpqph 0/1 Running 4 19m
ambassador-f6846547-jv4w4 0/1 Running 2 15m
ambassador-f6846547-prdnv 0/1 Running 2 15m
root@ip-:/opt/kube-control-files# kubectl describe pod ambassador-f6846547-cpqph
Name: ambassador-f6846547-cpqph
Namespace: default
Priority: 0
PriorityClassName: <none>
Node: .ec2.internal/
Start Time: Thu, 09 Aug 2018 11:31:34 -0700
Labels: pod-template-hash=92402103
service=ambassador
Status: Running
IP:
Controlled By: ReplicaSet/ambassador-f6846547
Containers:
ambassador:
Container ID: docker://25b00e27bbfbf8658487307d90fe37268e0db408c03e3935eefc87a69f2aca7d
Image: quay.io/datawire/ambassador:0.35.3
Image ID: docker-pullable://quay.io/datawire/ambassador@sha256:202669fc3043c5ae577b5255cd3b49f3a8038ca4f9fc02a5724bd2baffb7e389
Port: <none>
Host Port: <none>
State: Running
Started: Thu, 09 Aug 2018 11:52:34 -0700
Last State: Terminated
Reason: Error
Exit Code: 137
Started: Thu, 09 Aug 2018 11:49:43 -0700
Finished: Thu, 09 Aug 2018 11:52:33 -0700
Ready: False
Restart Count: 5
Limits:
cpu: 1
memory: 400Mi
Requests:
cpu: 200m
memory: 100Mi
Liveness: http-get http://:8877/ambassador/v0/check_alive delay=120s timeout=1s period=3s #success=1 #failure=3
Readiness: http-get http://:8877/ambassador/v0/check_ready delay=120s timeout=1s period=3s #success=1 #failure=3
Environment:
AMBASSADOR_NAMESPACE: default (v1:metadata.namespace)
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from ambassador-token-wfgkn (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
ambassador-token-wfgkn:
Type: Secret (a volume populated by a Secret)
SecretName: ambassador-token-wfgkn
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 21m default-scheduler Successfully assigned default/ambassador-f6846547-cpqph to .ec2.internal
Normal Pulling 18m (x2 over 21m) kubelet, .ec2.internal pulling image "quay.io/datawire/ambassador:0.35.3"
Normal Killing 18m kubelet, .ec2.internal Killing container with id docker://ambassador:Container failed liveness probe.. Container will be killed and recreated.
Normal Pulled 18m (x2 over 21m) kubelet, .ec2.internal Successfully pulled image "quay.io/datawire/ambassador:0.35.3"
Normal Created 18m (x2 over 21m) kubelet,.ec2.internal Created container
Normal Started 18m (x2 over 21m) kubelet, .ec2.internal Started container
Warning Unhealthy 6m (x29 over 18m) kubelet, .ec2.internal Readiness probe failed: Get http://:8877/ambassador/v0/check_ready: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Warning Unhealthy 43s (x13 over 18m) kubelet, .ec2.internal Liveness probe failed: Get http://:8877/ambassador/v0/check_alive: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
@ddeep2007 @Octopixell could you try out the latest release 0.39.0 and see if it fixed this issue? - https://github.com/datawire/ambassador/releases/latest
I'm facing this issue with 0.50.1 installed via helm:
Name: ambassador-8cb7878cc-bbk6d
Namespace: ambassador
Priority: 0
PriorityClassName: <none>
Node: aks-agentpool-12567806-5/10.240.0.15
Start Time: Mon, 18 Feb 2019 18:00:14 -0500
Labels: app=ambassador
pod-template-hash=476343477
release=ambassador
service=ambassador
Annotations: prometheus.io/port: 9102
prometheus.io/scrape: true
sidecar.istio.io/inject: false
Status: Running
IP: 10.244.1.111
Controlled By: ReplicaSet/ambassador-8cb7878cc
Containers:
statsd-sink:
Container ID: docker://64afb211f9460bc12abf7569467764c91132f05f7d543e2011a195327e8998a8
Image: prom/statsd-exporter:v0.6.0
Image ID: docker-pullable://prom/statsd-exporter@sha256:dc46327ef5577e13955a08ae42a15b83803f0cf50abdbe035edc67842636a997
Ports: 9102/TCP, 8125/TCP
Host Ports: 0/TCP, 0/TCP
Args:
-statsd.listen-address=:8125
-statsd.mapping-config=/statsd-exporter/mapping-config.yaml
State: Running
Started: Mon, 18 Feb 2019 18:00:17 -0500
Ready: True
Restart Count: 0
Environment:
KUBERNETES_PORT_443_TCP_ADDR: REDACTED.azmk8s.io
KUBERNETES_PORT: tcp://REDACTED.azmk8s.io:443
KUBERNETES_PORT_443_TCP: tcp://REDACTED.azmk8s.io:443
KUBERNETES_SERVICE_HOST: REDACTED.azmk8s.io
Mounts:
/statsd-exporter/ from stats-exporter-mapping-config (ro)
/var/run/secrets/kubernetes.io/serviceaccount from ambassador-token-24c4t (ro)
ambassador:
Container ID: docker://7613cc0558a929c561562e6aaa0d3fedfe5c9753292088702b0b04bc8db2b1c0
Image: quay.io/datawire/ambassador:0.50.1
Image ID: docker-pullable://quay.io/datawire/ambassador@sha256:eb7d8a22016e3442062b4510890182d48ac07035644e0e374659bee260434c4b
Ports: 80/TCP, 443/TCP, 8877/TCP
Host Ports: 0/TCP, 0/TCP, 0/TCP
State: Running
Started: Mon, 18 Feb 2019 18:03:43 -0500
Last State: Terminated
Reason: Error
Exit Code: 137
Started: Mon, 18 Feb 2019 18:02:33 -0500
Finished: Mon, 18 Feb 2019 18:03:42 -0500
Ready: False
Restart Count: 3
Liveness: http-get http://:admin/ambassador/v0/check_alive delay=30s timeout=1s period=3s #success=1 #failure=3
Readiness: http-get http://:admin/ambassador/v0/check_ready delay=30s timeout=1s period=3s #success=1 #failure=3
Environment:
AMBASSADOR_ID: default
AMBASSADOR_NAMESPACE: ambassador (v1:metadata.namespace)
KUBERNETES_PORT_443_TCP_ADDR: REDACTED.azmk8s.io
KUBERNETES_PORT: tcp://REDACTED.azmk8s.io:443
KUBERNETES_PORT_443_TCP: tcp://REDACTED.azmk8s.io:443
KUBERNETES_SERVICE_HOST: REDACTED.azmk8s.io
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from ambassador-token-24c4t (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
stats-exporter-mapping-config:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: ambassador-config
Optional: false
ambassador-token-24c4t:
Type: Secret (a volume populated by a Secret)
SecretName: ambassador-token-24c4t
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 4m4s default-scheduler Successfully assigned ambassador/ambassador-8cb7878cc-bbk6d to aks-agentpool-12567806-5
Normal Pulled 4m2s kubelet, aks-agentpool-12567806-5 Container image "prom/statsd-exporter:v0.6.0" already present on machine
Normal Created 4m2s kubelet, aks-agentpool-12567806-5 Created container
Normal Started 4m1s kubelet, aks-agentpool-12567806-5 Started container
Normal Started 4m1s kubelet, aks-agentpool-12567806-5 Started container
Warning Unhealthy 3m25s (x3 over 3m31s) kubelet, aks-agentpool-12567806-5 Liveness probe failed: HTTP probe failed with statuscode: 503
Warning Unhealthy 2m55s (x13 over 3m31s) kubelet, aks-agentpool-12567806-5 Readiness probe failed: HTTP probe failed with statuscode: 503
Normal Pulled 2m54s (x2 over 4m1s) kubelet, aks-agentpool-12567806-5 Container image "quay.io/datawire/ambassador:0.50.1" already present on machine
Normal Created 2m54s (x2 over 4m1s) kubelet, aks-agentpool-12567806-5 Created container
Normal Killing 2m54s kubelet, aks-agentpool-12567806-5 Killing container with id docker://ambassador:Container failed liveness probe.. Container will be killed and recreated.
Hmm, I was installing via the datawire/ambassador
chart. Now I tried again with stable/ambassador
and it works.
Have the maintainers stopped trying to solve this problem? We are experiencing the same readiness/liveness probe failures. We are using the stable/ambassador
chart and no luck.
I ran into a similar issue and resolved it by moving to the latest version (0.53.1) and ensuring I was using the latest deployment provided here:
https://getambassador.io/yaml/ambassador/ambassador-rbac.yaml
I believe it may be related to the addition of the podAntiAffinity rule that has been added.
There is no single root cause of this issue. We have investigated and put in place many different mitigations for various common cases, and at this point it seems we work reliably for the "common case". I'm going to close this issue and if there are specific reproducible issues, we can open up separate issues for them.
It happened to my cluster also. How can I fix that?
It is happening with my cluster also. Basically it happens when ambassador pod running from long time , suddenly it is showing readiness probe error .
ambassador running with 2 pods prometheus health check /metrics returning 200
Ambassador pod logs:-
I0401 15:28:59.184501 153 leaderelection.go:246] failed to acquire lease ambassador/acmeclient 2020-04-01 15:29:08 diagd 1.2.2 [P141TThreadPoolExecutor-0_4] INFO: XXXXXX-2860-4257-85C3-XXXXXX: 100.102.0.7 “GET /metrics” START 2020-04-01 15:29:08 diagd 1.2.2 [P141TThreadPoolExecutor-0_4] INFO: XXXXXX-2860-4257-85C3-XXXXXX: 100.102.0.7: 100.102.0.7 “GET /metrics” 10ms 200 success I0401 15:29:13.101967 153 leaderelection.go:350] lock is held by ambassador-xxxxx-qdlgk and has not yet expired I0401 15:29:13.101983 153 leaderelection.go:246] failed to acquire lease ambassador/acmeclient 2020-04-01 15:29:18 diagd 1.2.2 [P141TThreadPoolExecutor-0_3] INFO: XXXXXX-2860-4257-85C3-XXXXXX: 100.102.0.7: 100.102.0.7 “GET /metrics” START 2020-04-01 15:29:18 diagd 1.2.2 [P141TThreadPoolExecutor-0_3] INFO: XXXXXX-2860-4257-85C3-XXXXXX: 100.102.0.7: 100.102.0.7 “GET /metrics” 11ms 200 success 2020-04-01 15:29:28 diagd 1.2.2 [P141TThreadPoolExecutor-0_5] INFO:XXXXXX-2860-4257-85C3-XXXXXX: 100.102.0.7: 100.102.0.7 “GET /metrics” START 2020-04-01 15:29:28 diagd 1.2.2 [P141TThreadPoolExecutor-0_5] INFO: XXXXXXX-301F-XXXX-8A93-XXXX: 100.102.0.7 “GET /metrics” 11ms 200 success
it's same issue with our cluster as well with image tag 1.5.3
Describe the bug Readiness and liveness probes are failing on a Kubernetes 1.11 cluster.
To Reproduce
I can hit the liveness and readiness url’s from the browser and I am getting the correct liveness and readiness messages so it seems like they are accessible.
Versions (please complete the following information):
log.txt