Closed diranged closed 1 year ago
Can anyone in the Envoy team take a look at this issue?
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.
Not stale. Can we get any eyes on this?
This issue sounds like duplicate of https://github.com/envoyproxy/envoy/issues/29008, @diranged could you check if disabling locality weighted load balancing helps and works for you?
@nezdolik We don't have locality weighted load balancing turned on in this set of tests. Our Istio environment has the localityLbSetting
setting turned on, but we don't have outlier detection turned on for these tests which should mean that the configuration is ignored. Furthermore, in our case we aren't using any slow start config. Do you still think the issue is the same, or related?
Apologies, i was skimming through all opened issues with keywords "traffic imbalance+slow start" to post an update about potential root cause, should have read the report more carefully.
No worries - I definitely still think there's an interesting bug here to track down..
@diranged we also don't have outlier settings turned on but the locality still affects the load balancing so it might be related.
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.
This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.
Note, I am moving https://github.com/istio/istio/issues/45013 into the envoy project per the recommendation from the Istio team. This is a copy of the original bug I opened with Istio a while back.
Bug Description
We've been working to migrate a new workload into our Kubernetes cluster that operates with an Istio mesh. This mesh has been operating fine for most workloads at a good scale (500k QPS+ across the mesh). The team migrating their workload over though noticed a strange behavior and we've been able to replicate the behavior with a synthetic test.
The behavior they see is that during backend-pod changes (either a scale up, or rollout, or scale down) - the backing pods suddenly get a significant traffic imbalance. One pod will take "most" of the traffic for 10-30s while the other pods drop their traffic rate significantly.
Graphing out the requests using Loki (following the
istio-proxy
access logs), we can see the behavior: In this example, we are adding a new pod (the green line), but we see a massive spike to the red line, and a drop in traffic to the blue line.We've spent weeks troubleshooting this before we set out to replicate it synthetically. We have a number of other high volume workloads that do not exhibit this behavior, which led us to spend time troubleshooting the application for rather than the mesh.
The Test Setup
Service: Running
httpbin
across ~3-5 pods Client(s):1000
pods runningcurl
every second.10
pods runningh2load
as fast as they can (generating ~4000-5000 QPS)Test 1: Fast Requests, Scale down from 4 -> 3 backends The early tests we were running were as fast as possible... hitting the
httpbin
endpoint/status/200
that returns immediately. We ran many scale-up and scale-down tests. Every time, the pattern looked good (we ran dozens of tests, they all looked roughly like this):3 -> 4 pods: Here we can see the new pod comes up and the traffic transition is pretty graceful. There's a bit of a change in the total traffic volume because we were overloading the backend pods for the test on purpose
4 -> 3 pods: This again shows a pretty clean transition
Test 2: Slower Requests
The key for us to reproduce the problem was to use slower requests - the backend application in this case responds on average in
~150ms
. So instead of hitting/status/200
we are now going to hit/delay/.15
and run the same tests. It's worth noting that because the requests take longer, the total request volume drops significantly. That's fine, it's more reflective of our actual traffic volume.3 -> 4 pods: We would expect similar traffic distribution ... but instead we see a significant spike on one backend, with drops in the other backends. Of course "significant" depends on the application... our backend application actually behaves worse than this graph depicts, but httpbin is a nice fast async service ... and the backend service in this question is a python service that is much slower.
5 -> 4 -> 3: When scaling down, we can see a pretty exaggerated effect as well...
5 -> 5: During a pod replacement (one of our backends is replaced), you can see a really dramatic load imabalance:
Test 3: Really Slow (1s) Requests
Just for fun, I ran the same tests against
/delay/1
to test one second long requests..3 -> 4 -> 3: In this window you can see us scale from 3 to 4 and back to 3 pods... The behavior isn't quite the same as the tests above, but again you can see pretty significant imbalances.
Additional Note: First one wins?
While digging into this, we discovered that almost all of the time the "first" pod in the client envoy config list is the one that gets the traffic spike. This implies to us that something is happening in the envoy client when the list of backing pods changes, and then they "start over" in some way and connect to the first pod in the list. We tried every type of load balancing algorithm and none of them changed the behaviors in any way.
Version
client.yaml
Affected product area