Open notjames opened 2 years ago
Describe the bug Observation was that over a period of roughly 1.5 weeks after a code release upgrading ambassador to 1.13.3, one arbitrary ambassador pod (1 of 3 pods) would fail to send traffic to the same upstream service. Once failures started, It (the errant ambassador pod) logged
503 UH
errors for attempts to that upstream service.The upstream service itself was never problematic. During troubleshooting and efforts to find a root cause, we'd delete the errant ambassador pod and it's replacement would work just fine for an arbitrary period of time (hours and usually within 5 or 5 or less). However, we observed at least once that the replacement pod(s) would, in time, begin to behave as its predecessor did. Yesterday we decided to enable
AES_LOGGING: debug
in the ambassador deployment, which restarted all the pods, and since then we've not seen this issue return 🤞🏼.To Reproduce So far, we are not able to provide this information
Expected behavior We expected ambassador to properly proxy traffic to the healthy upstream service.
Versions (please complete the following information):
* Ambassador: 1.13.3 * Kubernetes environment: EKS * Version: 1.21
Additional context See attached logfile. Unfortunately, this was what I was able to recover as I did not save all log data during investigation. This logfile will illustrate the observation, but it's not raw data so forensically it might not be very helpful.
FTR, yesterday the issue cropped up again, so it's not solved yet.
Cool, so after our meeting this morning, I have a few things to report...
503 UH issue is addressed in Version 2.4.0. We'd appreciate your feedback on whether this resolves that problem.
@cindymullins-dw let's tie this ticket into the other work you're doing for bluescape. Since I'm not able to personally test this and since there's so much other work ongoing for us, I think it's OK to close this ticket.
Describe the bug Observation was that over a period of roughly 1.5 weeks after a code release upgrading ambassador to 1.13.3, one arbitrary ambassador pod (1 of 3 pods) would fail to send traffic to the same upstream service. Once failures started, It (the errant ambassador pod) logged
503 UH
errors for attempts to that upstream service.The upstream service itself was never problematic. During troubleshooting and efforts to find a root cause, we'd delete the errant ambassador pod and it's replacement would work just fine for an arbitrary period of time (hours and usually within 5 or 5 or less). However, we observed at least once that the replacement pod(s) would, in time, begin to behave as its predecessor did. Yesterday we decided to enable
AES_LOGGING: debug
in the ambassador deployment, which restarted all the pods, and since then we've not seen this issue return 🤞🏼.To Reproduce So far, we are not able to provide this information
Expected behavior We expected ambassador to properly proxy traffic to the healthy upstream service.
Versions (please complete the following information):
Additional context See attached logfile. Unfortunately, this was what I was able to recover as I did not save all log data during investigation. This logfile will illustrate the observation, but it's not raw data so forensically it might not be very helpful.
scrubbed-503-output.txt