emissary-ingress / emissary

open source Kubernetes-native API gateway for microservices built on the Envoy Proxy
https://www.getambassador.io
Apache License 2.0
4.37k stars 685 forks source link

Single Ambassador pod throws `503 UH`s errors after arbitrary runtime proxying to service (K8s replicaset) #4284

Open notjames opened 2 years ago

notjames commented 2 years ago

Describe the bug Observation was that over a period of roughly 1.5 weeks after a code release upgrading ambassador to 1.13.3, one arbitrary ambassador pod (1 of 3 pods) would fail to send traffic to the same upstream service. Once failures started, It (the errant ambassador pod) logged 503 UH errors for attempts to that upstream service.

The upstream service itself was never problematic. During troubleshooting and efforts to find a root cause, we'd delete the errant ambassador pod and it's replacement would work just fine for an arbitrary period of time (hours and usually within 5 or 5 or less). However, we observed at least once that the replacement pod(s) would, in time, begin to behave as its predecessor did. Yesterday we decided to enable AES_LOGGING: debug in the ambassador deployment, which restarted all the pods, and since then we've not seen this issue return 🤞🏼.

To Reproduce So far, we are not able to provide this information

Expected behavior We expected ambassador to properly proxy traffic to the healthy upstream service.

Versions (please complete the following information):

Additional context See attached logfile. Unfortunately, this was what I was able to recover as I did not save all log data during investigation. This logfile will illustrate the observation, but it's not raw data so forensically it might not be very helpful.

scrubbed-503-output.txt

notjames commented 2 years ago

Describe the bug Observation was that over a period of roughly 1.5 weeks after a code release upgrading ambassador to 1.13.3, one arbitrary ambassador pod (1 of 3 pods) would fail to send traffic to the same upstream service. Once failures started, It (the errant ambassador pod) logged 503 UH errors for attempts to that upstream service.

The upstream service itself was never problematic. During troubleshooting and efforts to find a root cause, we'd delete the errant ambassador pod and it's replacement would work just fine for an arbitrary period of time (hours and usually within 5 or 5 or less). However, we observed at least once that the replacement pod(s) would, in time, begin to behave as its predecessor did. Yesterday we decided to enable AES_LOGGING: debug in the ambassador deployment, which restarted all the pods, and since then we've not seen this issue return 🤞🏼.

To Reproduce So far, we are not able to provide this information

Expected behavior We expected ambassador to properly proxy traffic to the healthy upstream service.

Versions (please complete the following information):

* Ambassador: 1.13.3

* Kubernetes environment: EKS

* Version: 1.21

Additional context See attached logfile. Unfortunately, this was what I was able to recover as I did not save all log data during investigation. This logfile will illustrate the observation, but it's not raw data so forensically it might not be very helpful.

scrubbed-503-output.txt

FTR, yesterday the issue cropped up again, so it's not solved yet.

notjames commented 2 years ago

Cool, so after our meeting this morning, I have a few things to report...

  1. we aren't using ambassador cloud nor can we. As far as the documentation provided this morning was understood by me, without use of ambassador cloud, devportal is not enabled. Please correct me if I'm wrong.
  2. I've enabled envoy debugging and will be inspecting those logs shortly.
  3. Curl tests are interesting. The upstream service has two ports open. One of them returns a 503 and the other returns a 200. More to come on this as I jump into that rabbit hole.
cindymullins-dw commented 2 years ago

503 UH issue is addressed in Version 2.4.0. We'd appreciate your feedback on whether this resolves that problem.

notjames commented 1 year ago

@cindymullins-dw let's tie this ticket into the other work you're doing for bluescape. Since I'm not able to personally test this and since there's so much other work ongoing for us, I think it's OK to close this ticket.