Closed etotten closed 3 years ago
Update on this:
We wanted to see if the AMBASSADOR_FAST_RECONFIGURE variable was likely to help us, if we could get it to run. So, for testing purposes only, managed to hack around the error mentioned above
Get \"http://$%7BHOST_IP%7D:8500/v1/health/service/bedrock_brts-sidecar-proxy?passing=1\": dial tcp: lookup ${HOST_IP}: no such host, retry in 5s" CMD=entrypoint PID=1 oops-i-did-not-pass-context-correctly=true
...by setting a specific IP address in the Consul Resolver k8s resource instead of ${HOST_IP}
. This worked, and we were able to disable AMBASSADOR_LEGACY_MODE and set AMBASSADOR_FAST_RECONFIGURE: true. This is of course not a viable work-around, but at least it allows us to try it out in a test environment with a single replica of Ambassador.
With that test environment, we were able to see that enabling AMBASSADOR_FAST_RECONFIGURE and a preStop handler on our deployment for our service-under-test:
lifecycle:
preStop:
exec:
command: ["/bin/sleep","5"]
...we were able to see that AMBASSADOR_FAST_RECONFIGURE does indeed decrease the number of 503's we see happening when pods terminate. In a lot of cases, it's zero 503's that happen, but sometimes is 1 or 2 when pods go down.
So that's a significant improvement at least. I experimented with a retry_policy for "5xx" on mappings that use the Consul Resolver, but didn't see that push the failures to zero. Need to do more testing in that area.
I see that #3182 is merged and slotted for 1.13, so I will close this once 1.13 is out and confirmed to work with the Consul Resolver.
Update:
Memory Usage: throttling reconfig v14739 due to constrained memory with 30 stale reconfigs (30 max)
which indicate that Ambassador is not allowing updates to the clusters, and so we would have lots of 503's related to trying to send to pods that weren't there any more.With the learnings posted above, the problems mentioned directly in this issue are mitigated. This can now be closed.
There are further issues that we see creating instability in our upstreams, but those are handled more specifically in other issues:
Describe the bug I tried various ways of getting to the point of full gracefulness when either:
a) doing a rolling deploy (and thus pods come down 1 at a time) b) doing a scale down (and thus a bunch of pods come down simultaneously)
...but I have not found a way to make it so that 0 requests get dropped when pods terminate. On the flip side, for pod launch (e.g. scale up), we do better - some combination of the health checks and perhaps delays are making that operation work fairly gracefully in my testing. Termination is the problem.
I am looking for any suggestions; any help would be hugely appreciated.
To Reproduce This is kind of involved, but here's the setup:
${HOST_IP}:8500
Results:
Expected behavior To not have errors (HTTP 503's) for requests during the termination of a pod running in Consul Connect
Versions (please complete the following information):
Additional context What Kinds of Things Have I Tried:
I suspect that AES is not keeping up with the mesh changes. In looking through the AES debug logs, I find lines like this:
...and if I tail | grep for those while the deploy or scale down is happening, I see that these logs can come many seconds later. I can't say that the 503's stop happening right when these logs come (which would be evidence), but I'm of course grasping at straws a bit.
WHAT ABOUT AMBASSADOR_FAST_RECONFIGURE?
Logs of non-functional AES 1.12.0 w/ AMBASSADOR_FAST_RECONFIGURE: true: