Ambassador sends unexpected 503 on check_alive endpoint at startup

knlambert commented 5 years ago

Describe the bug

Ambassador live probe sends "ambassador seems to have died" during k8s watch initializations.

I've put a simple bash script during the initialization with a kubectl port-forward to evaluate how long it lasts :

kubectl port-forward <pod> 8877

while [ true ]
do
    date
    curl 127.0.0.1:8877/ambassador/v0/check_alive
    curl 127.0.0.1:8877/ambassador/v0/check_ready
    sleep 1
done

Here the post-mortem :

Thu 29 Aug 2019 22:59:19 EDT
ambassador liveness check OK (18 seconds)
ambassador waiting for config
Thu 29 Aug 2019 22:59:21 EDT
ambassador seems to have died (20 seconds)
ambassador waiting for config
Thu 29 Aug 2019 22:59:22 EDT
ambassador seems to have died (21 seconds)
ambassador waiting for config
[...]
Thu 29 Aug 2019 22:59:40 EDT
ambassador seems to have died (39 seconds)
ambassador not ready (Never updated)
Thu 29 Aug 2019 22:59:42 EDT
ambassador seems to have died (41 seconds)
ambassador not ready (Never updated)
Thu 29 Aug 2019 22:59:43 EDT
ambassador liveness check OK (42 seconds)

Ambassador is down during 24 s, and it's a problem because in your official helmchart, the liveness probe initial delay is only 30s (at this point, ambassador has already been started for 18s now, creating an infinite crash loop).

The failure of the live probe seems to correspond with the scan of the k8s watches, when Ambassador wait for them to start. It just stops answering in the middle of the procedure. Here some logs at 59:23

2019/08/30 02:59:23 aggregator: initialized k8s watch: secret|some-namespace|metadata.name=some-secret-with-certificate
2019/08/30 02:59:23 aggregator: waiting for k8s watch: secret|some-namespace|metadata.name=some-secret-with-certificate
2019/08/30 02:59:23 aggregator: waiting for k8s watch: secret|some-namespace|metadata.name=some-secret-with-certificate
2019/08/30 02:59:23 aggregator: waiting for k8s watch: secret|some-namespace|metadata.name=some-secret-with-certificate
2019/08/30 02:59:23 aggregator: initialized k8s watch: secret|some-namespace|metadata.name=some-secret-with-certificate
2019/08/30 02:59:23 aggregator: waiting for k8s watch: secret|some-namespace|metadata.name=some-secret-with-certificate
2019/08/30 02:59:23 aggregator: initialized k8s watch: secret|some-namespace|metadata.name=some-secret-with-certificate
2019/08/30 02:59:23 aggregator: initialized k8s watch: secret|some-namespace|metadata.name=some-secret-with-certificate
2019/08/30 02:59:23 aggregator: waiting for k8s watch: secret|some-namespace|metadata.name=some-secret-with-certificate
2019/08/30 02:59:23 aggregator: waiting for k8s watch: secret|some-namespace|metadata.name=some-secret-with-certificate
2019/08/30 02:59:23 aggregator: initialized k8s watch: secret|some-namespace|metadata.name=some-secret-with-certificate
2019/08/30 02:59:23 aggregator: initialized k8s watch: secret|some-namespace|metadata.name=some-secret-with-certificate
2019/08/30 02:59:23 aggregator: waiting for k8s watch: secret|some-namespace|metadata.name=some-secret-with-certificate
2019/08/30 02:59:23 aggregator: waiting for k8s watch: secret|some-namespace|metadata.name=some-secret-with-certificate
2019/08/30 02:59:23 aggregator: initialized k8s watch: secret|some-namespace|metadata.name=some-secret-with-certificate

Then it comes back up at 59:43, when all the watch are up to date :

[2019-08-30 02:59:39.559][240][info][upstream] [source/server/lds_api.cc:60] lds: add/update listener 'ambassador-listener-8443'
2019-08-30 02:59:39 diagd 0.73.0 [P155TAmbassadorEventWatcher] WARNING: Scout: could not post report: HTTPSConnectionPool(host='kubernaut.io', port=443): Max retries exceeded with url: /scout (Caused by ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x7f6a98d51630>, 'Connection to kubernaut.io timed out. (connect timeout=1)'))
2019-08-30 02:59:39 diagd 0.73.0 [P155TAmbassadorEventWatcher] INFO: Scout reports {"latest_version": "0.73.0", "exception": "could not post report: HTTPSConnectionPool(host='kubernaut.io', port=443): Max retries exceeded with url: /scout (Caused by ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x7f6a98d51630>, 'Connection to kubernaut.io timed out. (connect timeout=1)'))", "cached": false, "timestamp": 1567133979.723536}
2019-08-30 02:59:39 diagd 0.73.0 [P155TAmbassadorEventWatcher] INFO: Scout notices: []

To Reproduce

Steps to reproduce the behavior:

Create a service like this on your cluster.

apiVersion: v1
kind: Service
metadata:
  annotations:
    getambassador.io/config: |
      apiVersion: ambassador/v1
      kind:  Mapping
      ambassador_id: production
      name:  my_service_mapping_number_n
      prefix: "/"
      service: service_n.namespace_n.svc.cluster.local
      host: service_n.domain.com
      ---
      apiVersion: ambassador/v1
      kind: TLSContext
      name: my_service_context_number_n
      ambassador_id: production
      hosts:
      - service_n.domain.com
      secret: my-service-certificate.namespace_n

Create a lot of them, like at least 15 different services.

Expected behavior

Ambassador should not say it dies during initialization when obviously it's not the case, it should just say it is not ready.

This is painfull because in the helm chart, the initial delay is set to 30s.

Our ambassador is a shared service for us, with a lot of apps on it, so it takes more time than that to start, and the live probe is able to catch the 503 errors described above.

Then kubernetes kills the pod and it's the infinite crash loop.

Versions

Ambassador: 0.73
Kubernetes environment: Google Kubernetes Engine
Version: 1.12.7

Additional context

We have a lot of services attached to this ambassador, and unfortunately it makes this bug bigger than it really is. The fact that ambassador has to load a huge amount of certificate doesn't help, but the security team in the company don't want to give wild cards, and we can't afford to manage one ambassador per team.

hugocarrasco commented 5 years ago

Probably also related to #1661. I know it works by increasing the waiting time, but this shouldn't be something to be done manually (in my case I'm using the chart).

klambertbnc commented 5 years ago

Hello,

Tested today with the 0.81.0, same problem. I've tested with some HTTPbin services using a different TLS certificate for each endpoint :

apiVersion: v1
kind: Service
metadata:
  annotations:
    getambassador.io/config: |
      apiVersion: ambassador/v1
      kind:  Mapping
      ambassador_id: staging
      name:  alpha-httpbin_mapping
      prefix: /
      rewrite: /
      service: alpha-httpbin.1111-nonproduction-staging.svc.cluster.local
      host: alpha-httpbin-test.mgmt.staging.1111.domain.com
      ---
      apiVersion: ambassador/v1
      kind: TLSContext
      ambassador_id: staging
      name: alpha-httpbin_mapping
      hosts:
      - alpha-httpbin-test.mgmt.staging.1111.domain.com
      secret: alpha-httpbin-certificate.1111-nonproduction-staging
  creationTimestamp: "2019-09-30T16:15:52Z"
  name: alpha-httpbin
  namespace: 1111-nonproduction-staging

Here the Envoy /stats endpoint return when it happens :

Ambassador check alive : ambassador seems to have died (23 seconds)

Envoy stats at the same time :

envoy_stats.txt

Thanks

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

klambertbnc commented 4 years ago

Still waiting, don't close.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

emissary-ingress / emissary

Ambassador sends unexpected 503 on check_alive endpoint at startup #1784