emissary-ingress / emissary

open source Kubernetes-native API gateway for microservices built on the Envoy Proxy
https://www.getambassador.io
Apache License 2.0
4.35k stars 681 forks source link

Can't achieve graceful shutdown of pods using Ambassador+ConsulConnect #3264

Closed etotten closed 3 years ago

etotten commented 3 years ago

Describe the bug I tried various ways of getting to the point of full gracefulness when either:

a) doing a rolling deploy (and thus pods come down 1 at a time) b) doing a scale down (and thus a bunch of pods come down simultaneously)

...but I have not found a way to make it so that 0 requests get dropped when pods terminate. On the flip side, for pod launch (e.g. scale up), we do better - some combination of the health checks and perhaps delays are making that operation work fairly gracefully in my testing. Termination is the problem.

I am looking for any suggestions; any help would be hugely appreciated.

To Reproduce This is kind of involved, but here's the setup:

Results:

Expected behavior To not have errors (HTTP 503's) for requests during the termination of a pod running in Consul Connect

Versions (please complete the following information):

Additional context What Kinds of Things Have I Tried:

I suspect that AES is not keeping up with the mesh changes. In looking through the AES debug logs, I find lines like this:

[2021-03-09 01:43:31.212][225][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1069] membership update for TLS cluster cluster_bedrock_brts_sidecar_proxy_otls_-0 added 1 removed 0

...and if I tail | grep for those while the deploy or scale down is happening, I see that these logs can come many seconds later. I can't say that the 503's stop happening right when these logs come (which would be evidence), but I'm of course grasping at straws a bit.

WHAT ABOUT AMBASSADOR_FAST_RECONFIGURE?

Logs of non-functional AES 1.12.0 w/ AMBASSADOR_FAST_RECONFIGURE: true:

time="2021-03-09 19:32:04" level=info msg="Started Ambassador" func=github.com/datawire/ambassador/cmd/entrypoint.Main file="github.com/datawire/ambassador/cmd/entrypoint/entrypoint.go:91" CMD=entrypoint PID=1
time="2021-03-09 19:32:04" level=info msg="AMBASSADOR_CLUSTER_ID=594b244d-c537-5190-9bcf-2a82b00c0b78" func=github.com/datawire/ambassador/cmd/entrypoint.Main file="github.com/datawire/ambassador/cmd/entrypoint/entrypoint.go:105" CMD=entrypoint PID=1
time="2021-03-09 19:32:04" level=info msg="[pid:13] started command []string{\"/ambassador/sidecars/amb-sidecar\"}" func="github.com/datawire/dlib/dexec.(*Cmd).Start" file="github.com/datawire/dlib@v1.2.0/dexec/cmd.go:143" CMD=entrypoint PID=1 THREAD=/amb-sidecar
time="2021-03-09 19:32:04" level=info msg="[pid:13] stdin  < not logging input read from file /dev/stdin" func="github.com/datawire/dlib/dexec.(*Cmd).Start" file="github.com/datawire/dlib@v1.2.0/dexec/cmd.go:145" CMD=entrypoint PID=1 THREAD=/amb-sidecar
time="2021-03-09 19:32:04" level=info msg="[pid:13] stdout > not logging output written to file /dev/stdout" func="github.com/datawire/dlib/dexec.(*Cmd).Start" file="github.com/datawire/dlib@v1.2.0/dexec/cmd.go:148" CMD=entrypoint PID=1 THREAD=/amb-sidecar
time="2021-03-09 19:32:04" level=info msg="[pid:13] stderr > not logging output written to file /dev/stderr" func="github.com/datawire/dlib/dexec.(*Cmd).Start" file="github.com/datawire/dlib@v1.2.0/dexec/cmd.go:151" CMD=entrypoint PID=1 THREAD=/amb-sidecar
time="2021-03-09 19:32:04" level=info msg="[pid:14] started command []string{\"diagd\", \"/ambassador/snapshots\", \"/ambassador/bootstrap-ads.json\", \"/ambassador/envoy/envoy.json\", \"--notices\", \"/ambassador/notices.json\", \"--port\", \"8004\", \"--kick\", \"kill -HUP 1\"}" func="github.com/datawire/dlib/dexec.(*Cmd).Start" file="github.com/datawire/dlib@v1.2.0/dexec/cmd.go:143" CMD=entrypoint PID=1 THREAD=/diagd
time="2021-03-09 19:32:04" level=info msg="[pid:14] stdin  < not logging input read from file /dev/stdin" func="github.com/datawire/dlib/dexec.(*Cmd).Start" file="github.com/datawire/dlib@v1.2.0/dexec/cmd.go:145" CMD=entrypoint PID=1 THREAD=/diagd
time="2021-03-09 19:32:04" level=info msg="[pid:14] stdout > not logging output written to file /dev/stdout" func="github.com/datawire/dlib/dexec.(*Cmd).Start" file="github.com/datawire/dlib@v1.2.0/dexec/cmd.go:148" CMD=entrypoint PID=1 THREAD=/diagd
time="2021-03-09 19:32:04" level=info msg="[pid:14] stderr > not logging output written to file /dev/stderr" func="github.com/datawire/dlib/dexec.(*Cmd).Start" file="github.com/datawire/dlib@v1.2.0/dexec/cmd.go:151" CMD=entrypoint PID=1 THREAD=/diagd
time="2021-03-09 19:32:05" level=info msg="Ambassador Edge Stack configuration loaded" func=github.com/datawire/apro/cmd/amb-sidecar.runE file="github.com/datawire/apro/cmd/amb-sidecar/main.go:123" CMD=amb-sidecar PID=13
time="2021-03-09T19:32:05Z" level=warning msg="statsd is not in use"
time="2021-03-09 19:32:05" level=error msg="Failed to create watch on /tmp/ambassador/.config/ambassador/: Changes might require a restart: no such file or directory" func=github.com/datawire/apro/cmd/amb-sidecar.triggerOnChange file="github.com/datawire/apro/cmd/amb-sidecar/files.go:56" CMD=amb-sidecar PID=13 THREAD=/license_refresh
time="2021-03-09 19:32:05" level=error msg="0 filters configured" func="github.com/datawire/apro/cmd/amb-sidecar/filters/controller.(*Controller).Watch.func1" file="github.com/datawire/apro/cmd/amb-sidecar/filters/controller/controller.go:150" CMD=amb-sidecar PID=13 THREAD=/auth_controller
2021/03/09 19:32:06 http: proxy error: dial tcp 127.0.0.1:8004: connect: connection refused
time="2021-03-09 19:32:06" level=error msg="Bad HTTP request: status_code=502" func="github.com/datawire/apro/cmd/amb-sidecar/devportal/server.(*DevPortalHTTPClient).Get.func1" file="github.com/datawire/apro/cmd/amb-sidecar/devportal/server/http.go:70" CMD=amb-sidecar PID=13 component=devportal mhost= url="http://127.0.0.1:8877/ambassador/v0/diag/?json=true"
time="2021-03-09 19:32:06" level=error msg="HTTP error 502 from http://127.0.0.1:8877/ambassador/v0/diag/?json=true" func="github.com/datawire/apro/cmd/amb-sidecar/devportal/server.(*DevPortalHTTPClient).Get" file="github.com/datawire/apro/cmd/amb-sidecar/devportal/server/http.go:77" CMD=amb-sidecar PID=13 component=devportal mhost= url="http://127.0.0.1:8877/ambassador/v0/diag/?json=true"
I0309 19:32:06.237957       1 request.go:621] Throttling request took 1.012909724s, request: GET:https://172.20.0.1:443/apis/getambassador.io/v2/devportals?limit=500&resourceVersion=0
2021-03-09 19:32:07 diagd 1.12.0 [P14TMainThread] INFO: AMBASSADOR_FAST_RECONFIGURE enabled, initializing cache
2021-03-09 19:32:07 diagd 1.12.0 [P14TMainThread] INFO: WILL NOT update Mapping status
2021-03-09 19:32:07 diagd 1.12.0 [P14TMainThread] INFO: thread count 5, listening on 127.0.0.1:8004
time="2021-03-09 19:32:07" level=info msg="[ERR] consul.watch: Watch (type: service) errored: Get \"http://$%7BHOST_IP%7D:8500/v1/health/service/bedrock_brts-sidecar-proxy?passing=1\": dial tcp: lookup ${HOST_IP}: no such host, retry in 5s" CMD=entrypoint PID=1 oops-i-did-not-pass-context-correctly=true
time="2021-03-09 19:32:07" level=info msg="[ERR] consul.watch: Watch (type: service) errored: Get \"http://$%7BHOST_IP%7D:8500/v1/health/service/bedrock_flask-starship-sidecar-proxy?passing=1\": dial tcp: lookup ${HOST_IP}: no such host, retry in 5s" CMD=entrypoint PID=1 oops-i-did-not-pass-context-correctly=true
time="2021-03-09 19:32:07" level=info msg="[ERR] consul.watch: Watch (type: service) errored: Get \"http://$%7BHOST_IP%7D:8500/v1/health/service/bedrock_flask-voyager-tottene-sidecar-proxy?passing=1\": dial tcp: lookup ${HOST_IP}: no such host, retry in 5s" CMD=entrypoint PID=1 oops-i-did-not-pass-context-correctly=true
time="2021-03-09 19:32:07" level=info msg="[ERR] consul.watch: Watch (type: service) errored: Get \"http://$%7BHOST_IP%7D:8500/v1/health/service/sis_flask-starship-sidecar-proxy?passing=1\": dial tcp: lookup ${HOST_IP}: no such host, retry in 5s" CMD=entrypoint PID=1 oops-i-did-not-pass-context-correctly=true
time="2021-03-09 19:32:07" level=info msg="[ERR] consul.watch: Watch (type: service) errored: Get \"http://$%7BHOST_IP%7D:8500/v1/health/service/brss-sidecar-proxy?passing=1\": dial tcp: lookup ${HOST_IP}: no such host, retry in 5s" CMD=entrypoint PID=1 oops-i-did-not-pass-context-correctly=true
2021-03-09 19:32:07 diagd 1.12.0 [P14TMainThread] INFO: Ambassador 1.12.0 booted
[2021-03-09 19:32:07 +0000] [14] [INFO] Starting gunicorn 20.0.4
[2021-03-09 19:32:07 +0000] [14] [INFO] Listening at: http://127.0.0.1:8004 (14)
[2021-03-09 19:32:07 +0000] [14] [INFO] Using worker: threads
[2021-03-09 19:32:07 +0000] [28] [INFO] Booting worker with pid: 28
2021-03-09 19:32:07 diagd 1.12.0 [P28TAEW] INFO: starting Scout checker and timer logger
2021-03-09 19:32:07 diagd 1.12.0 [P28TAEW] INFO: starting event watcher
time="2021-03-09 19:32:12" level=info msg="[ERR] consul.watch: Watch (type: service) errored: Get \"http://$%7BHOST_IP%7D:8500/v1/health/service/bedrock_flask-starship-sidecar-proxy?passing=1\": dial tcp: lookup ${HOST_IP}: no such host, retry in 20s" CMD=entrypoint PID=1 oops-i-did-not-pass-context-correctly=true
time="2021-03-09 19:32:12" level=info msg="[ERR] consul.watch: Watch (type: service) errored: Get \"http://$%7BHOST_IP%7D:8500/v1/health/service/bedrock_flask-voyager-tottene-sidecar-proxy?passing=1\": dial tcp: lookup ${HOST_IP}: no such host, retry in 20s" CMD=entrypoint PID=1 oops-i-did-not-pass-context-correctly=true
time="2021-03-09 19:32:12" level=info msg="[ERR] consul.watch: Watch (type: service) errored: Get \"http://$%7BHOST_IP%7D:8500/v1/health/service/sis_flask-starship-sidecar-proxy?passing=1\": dial tcp: lookup ${HOST_IP}: no such host, retry in 20s" CMD=entrypoint PID=1 oops-i-did-not-pass-context-correctly=true
time="2021-03-09 19:32:12" level=info msg="[ERR] consul.watch: Watch (type: service) errored: Get \"http://$%7BHOST_IP%7D:8500/v1/health/service/bedrock_brts-sidecar-proxy?passing=1\": dial tcp: lookup ${HOST_IP}: no such host, retry in 20s" CMD=entrypoint PID=1 oops-i-did-not-pass-context-correctly=true
time="2021-03-09 19:32:12" level=info msg="[ERR] consul.watch: Watch (type: service) errored: Get \"http://$%7BHOST_IP%7D:8500/v1/health/service/brss-sidecar-proxy?passing=1\": dial tcp: lookup ${HOST_IP}: no such host, retry in 20s" CMD=entrypoint PID=1 oops-i-did-not-pass-context-correctly=true
etotten commented 3 years ago

Update on this:

We wanted to see if the AMBASSADOR_FAST_RECONFIGURE variable was likely to help us, if we could get it to run. So, for testing purposes only, managed to hack around the error mentioned above

Get \"http://$%7BHOST_IP%7D:8500/v1/health/service/bedrock_brts-sidecar-proxy?passing=1\": dial tcp: lookup ${HOST_IP}: no such host, retry in 5s" CMD=entrypoint PID=1 oops-i-did-not-pass-context-correctly=true

...by setting a specific IP address in the Consul Resolver k8s resource instead of ${HOST_IP}. This worked, and we were able to disable AMBASSADOR_LEGACY_MODE and set AMBASSADOR_FAST_RECONFIGURE: true. This is of course not a viable work-around, but at least it allows us to try it out in a test environment with a single replica of Ambassador.

With that test environment, we were able to see that enabling AMBASSADOR_FAST_RECONFIGURE and a preStop handler on our deployment for our service-under-test:

        lifecycle:
          preStop:
            exec:
              command: ["/bin/sleep","5"]

...we were able to see that AMBASSADOR_FAST_RECONFIGURE does indeed decrease the number of 503's we see happening when pods terminate. In a lot of cases, it's zero 503's that happen, but sometimes is 1 or 2 when pods go down.

So that's a significant improvement at least. I experimented with a retry_policy for "5xx" on mappings that use the Consul Resolver, but didn't see that push the failures to zero. Need to do more testing in that area.

I see that #3182 is merged and slotted for 1.13, so I will close this once 1.13 is out and confirmed to work with the Consul Resolver.

etotten commented 3 years ago

Update:

etotten commented 3 years ago

With the learnings posted above, the problems mentioned directly in this issue are mitigated. This can now be closed.

There are further issues that we see creating instability in our upstreams, but those are handled more specifically in other issues: