Can't achieve graceful shutdown of pods using Ambassador+ConsulConnect

etotten commented 3 years ago

Describe the bug I tried various ways of getting to the point of full gracefulness when either:

a) doing a rolling deploy (and thus pods come down 1 at a time) b) doing a scale down (and thus a bunch of pods come down simultaneously)

...but I have not found a way to make it so that 0 requests get dropped when pods terminate. On the flip side, for pod launch (e.g. scale up), we do better - some combination of the health checks and perhaps delays are making that operation work fairly gracefully in my testing. Termination is the problem.

I am looking for any suggestions; any help would be hugely appreciated.

To Reproduce This is kind of involved, but here's the setup:

have a test service named brts running in a Consul Connect cluster
- brts has an envoy sidecar-proxy container running in it's pod that is created by a connect-injector
- so the sidecar is registered as a consul service
- the brts pods are not associated with a k8s service
have Ambassador 1.12.0 running in the k8s cluster via helm chart v6.5.2
- using a ConsulResolver pointing at Host ${HOST_IP}:8500
- launching an AWS NLB via the Ambassador helm chart
- running a single Ambassador pod for the moment to make visibility easier
- we do run separate internal and external ambassador instances, so I'm just testing ambassador-internal here
have a SlowCooker (buoyant) container running outside the k8s cluster sending traffic through the NLB (and thus to Ambassador->mesh->pods)
- for the moment, running a concurrency of 10 with 20 qps each (so 200 qps total)

Results:

when running brts with 2 pods and doing a rolling deploy that replaces them one at a time, I am seeing 10-15% of requests dropped in a 10sec period for each; then this cleans-up some seconds afterwards.
- when doing a scale down from 10 pods to 2 pods, I see 33%-50% of requests get dropped over a 10 sec period (or longer)

Expected behavior To not have errors (HTTP 503's) for requests during the termination of a pod running in Consul Connect

Versions (please complete the following information):

testing AES 1.12.0 with legacy mode on (we have been using 1.4.1 for a while) - tried 1.11.2 (but had similar problems)
testing Consul 1.9.3 (will be upgrading from 1.7.3 so that we can get the k8s health-check sync'ing feature)
k8s is AWS EKS, version 1.18

Additional context What Kinds of Things Have I Tried:

adding a preStop hook on the main brts container of 3, 5, and 10 seconds
altering the existing preStop hook on the envoy proxy container (which requires hacking consul-k8s, building, pushing up a container and such) to have a "&& sleep X" of 3, 5, and 10 seconds appended to the end of the service deregister call.
- Here I was thinking maybe the deregistration of the envoy sidecar was making requests 503 if Ambassador was still sending them to the pod
- having one or both of the above preStop hooks in place
- checking the Ambassador logs for any mention of recognizing that something has changed

I suspect that AES is not keeping up with the mesh changes. In looking through the AES debug logs, I find lines like this:

[2021-03-09 01:43:31.212][225][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1069] membership update for TLS cluster cluster_bedrock_brts_sidecar_proxy_otls_-0 added 1 removed 0

...and if I tail | grep for those while the deploy or scale down is happening, I see that these logs can come many seconds later. I can't say that the 503's stop happening right when these logs come (which would be evidence), but I'm of course grasping at straws a bit.

WHAT ABOUT AMBASSADOR_FAST_RECONFIGURE?

I have heard/read that this feature may speed up AES configuration changes, so I tried to give it a go
When launching AES 1.12.0 via Helm w/ this enabled, it will not pass redinessProbe checks
In fact, I have to run AMBASSADOR_LEGACY_MODE to get AES to successfully launch at all
I have looked through the logs, but haven't yet figured it out. At the end of this post is an excerpt from the logs for the failed AES startup
I see that there are some other issues that sound similar:
- https://github.com/datawire/ambassador-chart/issues/156
- https://github.com/datawire/ambassador/issues/3182

Logs of non-functional AES 1.12.0 w/ AMBASSADOR_FAST_RECONFIGURE: true:

time="2021-03-09 19:32:04" level=info msg="Started Ambassador" func=github.com/datawire/ambassador/cmd/entrypoint.Main file="github.com/datawire/ambassador/cmd/entrypoint/entrypoint.go:91" CMD=entrypoint PID=1
time="2021-03-09 19:32:04" level=info msg="AMBASSADOR_CLUSTER_ID=594b244d-c537-5190-9bcf-2a82b00c0b78" func=github.com/datawire/ambassador/cmd/entrypoint.Main file="github.com/datawire/ambassador/cmd/entrypoint/entrypoint.go:105" CMD=entrypoint PID=1
time="2021-03-09 19:32:04" level=info msg="[pid:13] started command []string{\"/ambassador/sidecars/amb-sidecar\"}" func="github.com/datawire/dlib/dexec.(*Cmd).Start" file="github.com/datawire/dlib@v1.2.0/dexec/cmd.go:143" CMD=entrypoint PID=1 THREAD=/amb-sidecar
time="2021-03-09 19:32:04" level=info msg="[pid:13] stdin  < not logging input read from file /dev/stdin" func="github.com/datawire/dlib/dexec.(*Cmd).Start" file="github.com/datawire/dlib@v1.2.0/dexec/cmd.go:145" CMD=entrypoint PID=1 THREAD=/amb-sidecar
time="2021-03-09 19:32:04" level=info msg="[pid:13] stdout > not logging output written to file /dev/stdout" func="github.com/datawire/dlib/dexec.(*Cmd).Start" file="github.com/datawire/dlib@v1.2.0/dexec/cmd.go:148" CMD=entrypoint PID=1 THREAD=/amb-sidecar
time="2021-03-09 19:32:04" level=info msg="[pid:13] stderr > not logging output written to file /dev/stderr" func="github.com/datawire/dlib/dexec.(*Cmd).Start" file="github.com/datawire/dlib@v1.2.0/dexec/cmd.go:151" CMD=entrypoint PID=1 THREAD=/amb-sidecar
time="2021-03-09 19:32:04" level=info msg="[pid:14] started command []string{\"diagd\", \"/ambassador/snapshots\", \"/ambassador/bootstrap-ads.json\", \"/ambassador/envoy/envoy.json\", \"--notices\", \"/ambassador/notices.json\", \"--port\", \"8004\", \"--kick\", \"kill -HUP 1\"}" func="github.com/datawire/dlib/dexec.(*Cmd).Start" file="github.com/datawire/dlib@v1.2.0/dexec/cmd.go:143" CMD=entrypoint PID=1 THREAD=/diagd
time="2021-03-09 19:32:04" level=info msg="[pid:14] stdin  < not logging input read from file /dev/stdin" func="github.com/datawire/dlib/dexec.(*Cmd).Start" file="github.com/datawire/dlib@v1.2.0/dexec/cmd.go:145" CMD=entrypoint PID=1 THREAD=/diagd
time="2021-03-09 19:32:04" level=info msg="[pid:14] stdout > not logging output written to file /dev/stdout" func="github.com/datawire/dlib/dexec.(*Cmd).Start" file="github.com/datawire/dlib@v1.2.0/dexec/cmd.go:148" CMD=entrypoint PID=1 THREAD=/diagd
time="2021-03-09 19:32:04" level=info msg="[pid:14] stderr > not logging output written to file /dev/stderr" func="github.com/datawire/dlib/dexec.(*Cmd).Start" file="github.com/datawire/dlib@v1.2.0/dexec/cmd.go:151" CMD=entrypoint PID=1 THREAD=/diagd
time="2021-03-09 19:32:05" level=info msg="Ambassador Edge Stack configuration loaded" func=github.com/datawire/apro/cmd/amb-sidecar.runE file="github.com/datawire/apro/cmd/amb-sidecar/main.go:123" CMD=amb-sidecar PID=13
time="2021-03-09T19:32:05Z" level=warning msg="statsd is not in use"
time="2021-03-09 19:32:05" level=error msg="Failed to create watch on /tmp/ambassador/.config/ambassador/: Changes might require a restart: no such file or directory" func=github.com/datawire/apro/cmd/amb-sidecar.triggerOnChange file="github.com/datawire/apro/cmd/amb-sidecar/files.go:56" CMD=amb-sidecar PID=13 THREAD=/license_refresh
time="2021-03-09 19:32:05" level=error msg="0 filters configured" func="github.com/datawire/apro/cmd/amb-sidecar/filters/controller.(*Controller).Watch.func1" file="github.com/datawire/apro/cmd/amb-sidecar/filters/controller/controller.go:150" CMD=amb-sidecar PID=13 THREAD=/auth_controller
2021/03/09 19:32:06 http: proxy error: dial tcp 127.0.0.1:8004: connect: connection refused
time="2021-03-09 19:32:06" level=error msg="Bad HTTP request: status_code=502" func="github.com/datawire/apro/cmd/amb-sidecar/devportal/server.(*DevPortalHTTPClient).Get.func1" file="github.com/datawire/apro/cmd/amb-sidecar/devportal/server/http.go:70" CMD=amb-sidecar PID=13 component=devportal mhost= url="http://127.0.0.1:8877/ambassador/v0/diag/?json=true"
time="2021-03-09 19:32:06" level=error msg="HTTP error 502 from http://127.0.0.1:8877/ambassador/v0/diag/?json=true" func="github.com/datawire/apro/cmd/amb-sidecar/devportal/server.(*DevPortalHTTPClient).Get" file="github.com/datawire/apro/cmd/amb-sidecar/devportal/server/http.go:77" CMD=amb-sidecar PID=13 component=devportal mhost= url="http://127.0.0.1:8877/ambassador/v0/diag/?json=true"
I0309 19:32:06.237957       1 request.go:621] Throttling request took 1.012909724s, request: GET:https://172.20.0.1:443/apis/getambassador.io/v2/devportals?limit=500&resourceVersion=0
2021-03-09 19:32:07 diagd 1.12.0 [P14TMainThread] INFO: AMBASSADOR_FAST_RECONFIGURE enabled, initializing cache
2021-03-09 19:32:07 diagd 1.12.0 [P14TMainThread] INFO: WILL NOT update Mapping status
2021-03-09 19:32:07 diagd 1.12.0 [P14TMainThread] INFO: thread count 5, listening on 127.0.0.1:8004
time="2021-03-09 19:32:07" level=info msg="[ERR] consul.watch: Watch (type: service) errored: Get \"http://$%7BHOST_IP%7D:8500/v1/health/service/bedrock_brts-sidecar-proxy?passing=1\": dial tcp: lookup ${HOST_IP}: no such host, retry in 5s" CMD=entrypoint PID=1 oops-i-did-not-pass-context-correctly=true
time="2021-03-09 19:32:07" level=info msg="[ERR] consul.watch: Watch (type: service) errored: Get \"http://$%7BHOST_IP%7D:8500/v1/health/service/bedrock_flask-starship-sidecar-proxy?passing=1\": dial tcp: lookup ${HOST_IP}: no such host, retry in 5s" CMD=entrypoint PID=1 oops-i-did-not-pass-context-correctly=true
time="2021-03-09 19:32:07" level=info msg="[ERR] consul.watch: Watch (type: service) errored: Get \"http://$%7BHOST_IP%7D:8500/v1/health/service/bedrock_flask-voyager-tottene-sidecar-proxy?passing=1\": dial tcp: lookup ${HOST_IP}: no such host, retry in 5s" CMD=entrypoint PID=1 oops-i-did-not-pass-context-correctly=true
time="2021-03-09 19:32:07" level=info msg="[ERR] consul.watch: Watch (type: service) errored: Get \"http://$%7BHOST_IP%7D:8500/v1/health/service/sis_flask-starship-sidecar-proxy?passing=1\": dial tcp: lookup ${HOST_IP}: no such host, retry in 5s" CMD=entrypoint PID=1 oops-i-did-not-pass-context-correctly=true
time="2021-03-09 19:32:07" level=info msg="[ERR] consul.watch: Watch (type: service) errored: Get \"http://$%7BHOST_IP%7D:8500/v1/health/service/brss-sidecar-proxy?passing=1\": dial tcp: lookup ${HOST_IP}: no such host, retry in 5s" CMD=entrypoint PID=1 oops-i-did-not-pass-context-correctly=true
2021-03-09 19:32:07 diagd 1.12.0 [P14TMainThread] INFO: Ambassador 1.12.0 booted
[2021-03-09 19:32:07 +0000] [14] [INFO] Starting gunicorn 20.0.4
[2021-03-09 19:32:07 +0000] [14] [INFO] Listening at: http://127.0.0.1:8004 (14)
[2021-03-09 19:32:07 +0000] [14] [INFO] Using worker: threads
[2021-03-09 19:32:07 +0000] [28] [INFO] Booting worker with pid: 28
2021-03-09 19:32:07 diagd 1.12.0 [P28TAEW] INFO: starting Scout checker and timer logger
2021-03-09 19:32:07 diagd 1.12.0 [P28TAEW] INFO: starting event watcher
time="2021-03-09 19:32:12" level=info msg="[ERR] consul.watch: Watch (type: service) errored: Get \"http://$%7BHOST_IP%7D:8500/v1/health/service/bedrock_flask-starship-sidecar-proxy?passing=1\": dial tcp: lookup ${HOST_IP}: no such host, retry in 20s" CMD=entrypoint PID=1 oops-i-did-not-pass-context-correctly=true
time="2021-03-09 19:32:12" level=info msg="[ERR] consul.watch: Watch (type: service) errored: Get \"http://$%7BHOST_IP%7D:8500/v1/health/service/bedrock_flask-voyager-tottene-sidecar-proxy?passing=1\": dial tcp: lookup ${HOST_IP}: no such host, retry in 20s" CMD=entrypoint PID=1 oops-i-did-not-pass-context-correctly=true
time="2021-03-09 19:32:12" level=info msg="[ERR] consul.watch: Watch (type: service) errored: Get \"http://$%7BHOST_IP%7D:8500/v1/health/service/sis_flask-starship-sidecar-proxy?passing=1\": dial tcp: lookup ${HOST_IP}: no such host, retry in 20s" CMD=entrypoint PID=1 oops-i-did-not-pass-context-correctly=true
time="2021-03-09 19:32:12" level=info msg="[ERR] consul.watch: Watch (type: service) errored: Get \"http://$%7BHOST_IP%7D:8500/v1/health/service/bedrock_brts-sidecar-proxy?passing=1\": dial tcp: lookup ${HOST_IP}: no such host, retry in 20s" CMD=entrypoint PID=1 oops-i-did-not-pass-context-correctly=true
time="2021-03-09 19:32:12" level=info msg="[ERR] consul.watch: Watch (type: service) errored: Get \"http://$%7BHOST_IP%7D:8500/v1/health/service/brss-sidecar-proxy?passing=1\": dial tcp: lookup ${HOST_IP}: no such host, retry in 20s" CMD=entrypoint PID=1 oops-i-did-not-pass-context-correctly=true

etotten commented 3 years ago

Update on this:

We wanted to see if the AMBASSADOR_FAST_RECONFIGURE variable was likely to help us, if we could get it to run. So, for testing purposes only, managed to hack around the error mentioned above

Get \"http://$%7BHOST_IP%7D:8500/v1/health/service/bedrock_brts-sidecar-proxy?passing=1\": dial tcp: lookup ${HOST_IP}: no such host, retry in 5s" CMD=entrypoint PID=1 oops-i-did-not-pass-context-correctly=true

...by setting a specific IP address in the Consul Resolver k8s resource instead of ${HOST_IP}. This worked, and we were able to disable AMBASSADOR_LEGACY_MODE and set AMBASSADOR_FAST_RECONFIGURE: true. This is of course not a viable work-around, but at least it allows us to try it out in a test environment with a single replica of Ambassador.

With that test environment, we were able to see that enabling AMBASSADOR_FAST_RECONFIGURE and a preStop handler on our deployment for our service-under-test:

        lifecycle:
          preStop:
            exec:
              command: ["/bin/sleep","5"]

...we were able to see that AMBASSADOR_FAST_RECONFIGURE does indeed decrease the number of 503's we see happening when pods terminate. In a lot of cases, it's zero 503's that happen, but sometimes is 1 or 2 when pods go down.

So that's a significant improvement at least. I experimented with a retry_policy for "5xx" on mappings that use the Consul Resolver, but didn't see that push the failures to zero. Need to do more testing in that area.

I see that #3182 is merged and slotted for 1.13, so I will close this once 1.13 is out and confirmed to work with the Consul Resolver.

etotten commented 3 years ago

Update:

In the previous comment, where I mention that AMBASSADOR_FAST_RECONFIGURE helped us - that turned out to be false. The better results in testing were likely just coincidental.
However, we have identified factors that significantly improve the gracefulness of Ambassador request handling during pod termination:
- Ensure that Ambassador's memory stays low
- we discovered the logs like this: Memory Usage: throttling reconfig v14739 due to constrained memory with 30 stale reconfigs (30 max) which indicate that Ambassador is not allowing updates to the clusters, and so we would have lots of 503's related to trying to send to pods that weren't there any more.
- found the: logic here
- Could also enable AMBASSADOR_AMBEX_NO_RATELIMIT, but at the moment, we are seeing some issues while trying to enable that, see https://github.com/emissary-ingress/emissary/issues/3680
- Ensure there's enough CPU on the envoy-sidecar containers of your app's pods
- this is specific to using consul connect. From a while back, we had a CPU limit of only 20m for these sidecars, and this caused it to take like 5 or 6 seconds for this sidecar to trigger a Deregister from Consul Connect after being terminated
- there was a lot of request latency and instability during this period
- with 100m for the envoy sidecars, we see ~1 sec for the pods to trigger a Deregister in consul
- Don't use AMBASSADOR_LEGACY_MODE
- as this dramatically increases the amount of time we saw for Ambassador to recognize membership changes after the Consul Connect Deregistration happened for the target app's envoy-sidecar.
- Specifically, it was about 5 seconds before the "membership update for TLS cluster" trace log would come through, whereas with AMBASSADOR_LEGACY_MODE=false, the membership updates would happen in less than 100ms after the Consul Deregister, and often less than 50ms.

etotten commented 3 years ago

With the learnings posted above, the problems mentioned directly in this issue are mitigated. This can now be closed.

There are further issues that we see creating instability in our upstreams, but those are handled more specifically in other issues:

emissary-ingress / emissary

Can't achieve graceful shutdown of pods using Ambassador+ConsulConnect #3264