Peak EWMA load balancing

envoyproxy / envoy

Cloud-native high-performance edge/middle/service proxy

https://www.envoyproxy.io

Apache License 2.0

24.67k stars 4.75k forks source link

Peak EWMA load balancing #20907

Open mbyio opened 2 years ago

mbyio commented 2 years ago

Title: I would like envoy to support peak EWMA load balancing, in addition to the existing load balancing algorithms.

Description:

Please correct me if this is already supported. :)

We often want least request/least loaded load balancing, which is supported by envoy. But I don't believe envoy has a way to provide load balancing that takes into account the latency of the targets. Peak EWMA load balancing is one algorithm that automatically prefers sending requests to targets with lower latency. This can sometimes reduce p99 latency without requiring any other changes, and it has other benefits, such as automatically preferring to send requests to targets closer to the source (because network round-trip time will be lower so latency will be lower).

In my specific use case, I have an application distributed across zones, and sometimes traffic has to pass through a load balancer to pass from one zone to another, and sometimes not. So the various latencies between points are all over the place. I'd like the load balancer to handle this for me automatically.

[optional Relevant Links:]

https://linkerd.io/2016/03/16/beyond-round-robin-load-balancing-for-latency/ https://servd.host/blog/intelligent-load-balancing https://faun.pub/adaptive-load-balancing-algorithm-and-implementation-6f13ccb61bea https://github.com/kubernetes/ingress-nginx/blob/main/rootfs/etc/nginx/lua/balancer/ewma.lua

mattklein123 commented 2 years ago

This has come up before. Would be happy to see this implemented. This is will be a little bit trickier than a simple load balancer as there will need to be feedback from other parts of the system that drive it, so it might need to be a load balancer and filter pair or something like that.

mbyio commented 2 years ago

This is will be a little bit trickier than a simple load balancer as there will need to be feedback from other parts of the system that drive it, so it might need to be a load balancer and filter pair or something like that.

Is that because it would need information about the latency?

daixiang0 commented 2 years ago

This is will be a little bit trickier than a simple load balancer as there will need to be feedback from other parts of the system that drive it, so it might need to be a load balancer and filter pair or something like that.

Is that because it would need information about the latency?

I think so since it needs response times for each worker, counts moving average then uses that value as an inverse weighting when deciding which instance to send traffic to. I am not sure we can directly get it from stats(metrics), or we can add record values for each thread, and dynamically calculate them to get the value for dispatch.

mattklein123 commented 2 years ago

Yes we will need to keep rolling per host latencies. This is certainly possible but is not a beginner item. Once we have the latencies the rest is fairly straightforward.

Note also that there is an open issue on latency based outlier detection so the same latency tracking system would feed that also.

mbyio commented 2 years ago

I see, thank you for the information! I was hoping this would be something I could take on, but this is sounding like it might be too difficult, given that it involves multiple systems.

jtway commented 2 years ago

Yes we will need to keep rolling per host latencies. This is certainly possible but is not a beginner item. Once we have the latencies the rest is fairly straightforward.

Note also that there is an open issue on latency based outlier detection so the same latency tracking system would feed that also.

I had seen temporal latency mentioned in the outlier detection docs, but had yet to see corresponding code. Guess I know why now. I know we're hoping to use latency in conjunction with our custom consistent hashing load balancing policy by marking latency based outliers as degraded. So I'll definitely be keeping an eye on this.

mattklein123 commented 2 years ago

Here is the issue tracker latency outlier detection: https://github.com/envoyproxy/envoy/issues/288.

All of the pieces are there in the code base to do this at this point including built in histograms. They just need to be put together in a way that makes sense. Potentially the best way to do this is to just directly add the ability for the outlier detection system to track host latency on a per host basis (configurable for perf reasons). Latency would be fed in by the router. At that point we could use it for direct outlier detection and also build a LB that consults this data also.

jizhuozhi commented 1 year ago

Maybe there is a point to pay attention:

When implementing latency-based load balancing (or PeakEWMA), there may be server returns an error (such as network disconnection from the server to MySQL or Redis), resulting in a fast response time, but in fact this is the worst.

Therefore, I mean it is necessary to consider whether the failure rate (5xx) should also be used as a calculation factor during implementation.

liorfranko commented 8 months ago

Is this currently under development? We have large services with slow pods and fast pods, the slower pods are deployed on slow or busy instances, and the fast pods are deployed on new-generation servers.

We migrated from Finagle to running with Envoy using Istio, with Power of Two Choices (P2C) + Peak EWMA. Due to how Envoy performs the load balancing (Without Peak EWMA), since the migration, we've had a split in CPU and latency, between the slow and the fast pods.

We're considering moving back to Finagle if we won't find any other solution. Even a workaround using EnvoyFilter that we will manage with an operator might be fine for us, but we couldn't make it work.

jizhuozhi commented 5 months ago

Hello, I want to try to implement the PeakEWMA algorithm in envoy. I have implemented improved PeakEWMA in mosn https://github.com/mosn/mosn/issues/2252 and it performs well.

But my main language is not c++, so this may take a long time, thanks.

tonya11en commented 5 months ago

I'm all for introducing a more sophisticated load balancing technique, but we should be really understand why what we have today does not address your specific use-case. Before getting into the weeds of the implementation in https://github.com/envoyproxy/envoy/pull/32942, can someone shed light on why the current LEAST_REQUEST load balancer doesn't work for you? Why is the active request count not a good signal and why is EWMA preferred?

I looked at the Linkerd article, but there is no investigation or discussion of why EWMA performs better than the "least loaded" algorithm. All it does is show a graph, but it's unclear what the selection mechanism is (P2C, etc...) once they derive the endpoint weights or how they derive the endpoint weights. The servd article gives even less information.

This matters because Envoy has things like the active_request_bias in the LeastRequestLbConfig, which will impact the behavior of the load balancer in ways that impact observed latencies. Are we sure that this knob won't give you the desired behavior and that it performs worse than the proposed EWMA approach?

I'd prefer a more rigorous discussion about the merits of EWMA, the implications of relying on observed latencies, and the behavior of the load balancer under some common overload scenarios. If there's a more in-depth blog post or paper you can drop here, that would be great too.

liorfranko commented 5 months ago

We use Envoy as a sidecar running on Kubernetes with Istio service-mesh. We have services with ~500 pods sending requests to other services with ~500 pods. The environment is a low-latency environment. Istio sets the weights of each endpoint: 1.

We run on variance hardware, and some endpoints are slower than others. We noticed that the Envoy LEAST_REQUEST algorithm doesn't work well when some pods are slower than others. The reason is that when the source pod chooses a destination pod, the active_request queues are mostly empty or at a size of 1/2, so the algorithm works as a round-robin and splits the requests evenly across all pods.

We see that all the pods in the service run at the same RP/S but at very different response times.

We migrated those services from running outside Kubernetes with Finagle + PeakEWMA. And after the migration, we noticed this performance degredation.

tonya11en commented 5 months ago

The reason is that when the source pod chooses a destination pod, the active_request queues are mostly empty or at a size of 1/2, so the algorithm works as a round-robin and splits the requests evenly across all pods.

How exactly are you measuring this? Is it expected that you have no in-flight requests?

If the active request counts are ever non-zero, there should be a difference in the effective weights of those endpoints. If not, then this is a deviation from expected behavior that we need to understand.

We see that all the pods in the service run at the same RP/S but at very different response times.

This is surprising to me. I would expect this to be reflected in the number of outstanding requests to each pod, since the average size of the [effective] active request queues would be a function of the RPS and request latencies. Can you share more details? Specifically, the actual RPS and latency numbers. It would be great to also see the exact distribution of RPS across your pods, so we can better understand this situation.

liorfranko commented 5 months ago

How exactly are you measuring this? Is it expected that you have no in-flight requests? Yes, we're running low-latency services (Up to 30ms response time)

Here is a graph showing the RP/S per pod, about ~135: Query is: sum(irate(istio_requests_total{reporter="destination",destination_canonical_service="$workload"})) by (pod) Screen Shot 2024-03-27 at 18 00 06

Here is a graph showing the response time per pod: Query is: histogram_quantile(0.5, sum(irate(istio_request_duration_milliseconds_bucket{reporter="destination",destination_canonical_service="$workload"}[1m])) by (le,pod)) Screen Shot 2024-03-27 at 18 12 31

When looking at Envoy:

istio-proxy@******:/$ curl http://localhost:15000/clusters | grep ******.svc.cluster.local | grep 8083 | grep -E "cx_active|weight"
outbound|8083||******.svc.cluster.local::******:8083::weight::1
outbound|8083||******.svc.cluster.local::******:8083::cx_active::2
outbound|8083||******.svc.cluster.local::******:8083::weight::1
outbound|8083||******.svc.cluster.local::******:8083::cx_active::2
outbound|8083||******.svc.cluster.local::******:8083::weight::1
outbound|8083||******.svc.cluster.local::******:8083::cx_active::2
outbound|8083||******.svc.cluster.local::******:8083::weight::1
outbound|8083||******.svc.cluster.local::******:8083::cx_active::2
outbound|8083||******.svc.cluster.local::******:8083::weight::1
outbound|8083||******.svc.cluster.local::******:8083::cx_active::2
outbound|8083||******.svc.cluster.local::******:8083::weight::1
outbound|8083||******.svc.cluster.local::******:8083::cx_active::2
outbound|8083||******.svc.cluster.local::******:8083::weight::1
outbound|8083||******.svc.cluster.local::******:8083::cx_active::2
outbound|8083||******.svc.cluster.local::******:8083::weight::1
outbound|8083||******.svc.cluster.local::******:8083::cx_active::2
outbound|8083||******.svc.cluster.local::******:8083::weight::1
outbound|8083||******.svc.cluster.local::******:8083::cx_active::2
outbound|8083||******.svc.cluster.local::******:8083::weight::1
outbound|8083||******.svc.cluster.local::******:8083::cx_active::2
outbound|8083||******.svc.cluster.local::******:8083::weight::1
outbound|8083||******.svc.cluster.local::******:8083::cx_active::2
outbound|8083||******.svc.cluster.local::******:8083::weight::1
outbound|8083||******.svc.cluster.local::******:8083::cx_active::2
outbound|8083||******.svc.cluster.local::******:8083::weight::1
outbound|8083||******.svc.cluster.local::******:8083::cx_active::2
outbound|8083||******.svc.cluster.local::******:8083::weight::1
outbound|8083||******.svc.cluster.local::******:8083::cx_active::2
outbound|8083||******.svc.cluster.local::******:8083::weight::1
outbound|8083||******.svc.cluster.local::******:8083::cx_active::2
outbound|8083||******.svc.cluster.local::******:8083::weight::1
outbound|8083||******.svc.cluster.local::******:8083::cx_active::2
outbound|8083||******.svc.cluster.local::******:8083::weight::1
outbound|8083||******.svc.cluster.local::******:8083::cx_active::2
outbound|8083||******.svc.cluster.local::******:8083::weight::1
outbound|8083||******.svc.cluster.local::******:8083::cx_active::2
outbound|8083||******.svc.cluster.local::******:8083::weight::1
outbound|8083||******.svc.cluster.local::******:8083::cx_active::2
outbound|8083||******.svc.cluster.local::******:8083::weight::1
outbound|8083||******.svc.cluster.local::******:8083::cx_active::2
outbound|8083||******.svc.cluster.local::******:8083::weight::1

tonya11en commented 5 months ago

This is great. Thanks for the extra information.

So, the cx_active stats are showing us the number of active connections. Envoy's LEAST_REQUEST load balancer normalizes endpoint weights by the number of active requests, so we need to understand what those values are if we want to get to the bottom of this. Unfortunately, we're only going to have visibility at the level of an Envoy cluster- the only stat exposed is upstream_rq_active which is per-cluster.

We can try to make an educated guess via a hand-wavy application of Little's law, but that only applies to averages, not the p50s you've provided. However, the RPS to each pod looks identical and I'll assume the p50 is close enough to the average to hazard a guess.

You've got p50 latencies ranging from ~20ms to ~30ms. So: 130 rq/s .02s (latency) ~= 3 rq 130 rq/s .03s (latency) ~= 4.5 rq

These active request numbers are close enough that I would feel comfortable claiming it's "balanced". Also, it would be better to look at the tail latencies (p95/p90) instead of the medians, since those are the ones affected most by imbalanced load.

Let me know your thoughts.

liorfranko commented 5 months ago

Thanks for the detailed explanation! All the above makes perfect sense, and when Envoy balances according to the current upstream_rq_active, I agree that it's perfectly balanced.

Here is the P50 of a specific service: Screen Shot 2024-04-01 at 19 04 03 And here is the P95 of the same service at the same time:

The fastest pod runs at a flat 40ms response time, while the slowest pod runs at +180ms

We think that by applying different weights based on historical response time, the spread of the tail latencies should decrease.

tonya11en commented 5 months ago

Thanks for the extra graphs, that helps in seeing the differences between backends.

We think that by applying different weights based on historical response time, the spread of the tail latencies should decrease.

My understanding of this approach is that it assumes that sending excess load to a backend causes the latency to increase, but I'm unsure about how you'd know when to send less load. We'd need to know what an "unloaded" response time would be for a backend to derive the new weight based on a latency measurement, otherwise we wouldn't know what to do with the information from latency measurements. How would that work, exactly and how does it compare to the current approach?

liorfranko commented 5 months ago

I think that success latency + error rate can provide a good measurement for weight.

tonya11en commented 5 months ago

I think that success latency + error rate can provide a good measurement for weight.

I'm not opposed to the idea, we just need to be more specific on what exactly this means. How does it look in practice? The level of detail I'm looking for is something like a formula for an endpoint's weight based on success latency and error rate.

We should also spell out how exactly one modifies a weight based on some observed latency, because it's not obvious from the previous comments.

liorfranko commented 5 months ago

I'm not quite sure how to provide the formula. Maybe @jizhuozhi can share his experience as he did for mosn https://github.com/mosn/mosn/issues/2252?

jizhuozhi commented 5 months ago

Hello, @tonya11en and @liorfranko , my original implementation (https://github.com/mosn/mosn/pull/2253) is just using PeakEWMA as linkerd and kubernetes-ingress without "unloaded" condition, and optimized in https://github.com/mosn/mosn/issues/2295 (with formula proof).

In mosn, we have both considered 4xx error and 5xx error with different bias, but this is not verified. In the production environment, we found that if it is a 4xx error, after load balancing the weight adjustment for all servers is consistent, so there is no need to provide additional 4xx error rate (rate limiting may require circuit breaker to solve), so I think just 5xx error is enough in envoy.

tonya11en commented 5 months ago

@jizhuozhi can you just briefly describe how all of this works here? I don't know how mosn works and would appreciate it if you could clearly describe how this new load balancing algorithm you're proposing would work for Envoy. We won't accept a change like this without associated documentation, so you'd have to write up an explanation anyway.

I can't seem to get a clear answer from anyone on what exactly you have in mind for how this would work. For example, let's assume you have historical latency information for all endpoints in the load balancer's host set- how do you derive the endpoint weights?

Also, can you speak to why the current LEAST_REQUEST load balancer (LRLB) insufficient here? It uses the number of in-flight requests, which is a function of request latency and request rate, so it's unclear why this doesn't work for your use case.

jizhuozhi commented 5 months ago

Hello @tonya11en , I'm sorry that I misunderstood what you meant. Let me fully explain the work I did.

First, the LEAST_REQUEST load balancer (LRLB) calculates the expected completion time of all tasks based on the number of active requests. It assumes that all servers have the same processing time, but in fact different servers have different processing times (limited by the processor model , NUMA nodes, network distance, etc.). Therefore, what PeakEWMA actually does is to make the mathematical expectations of LRLB more accurate by recording historical response time indicators, and to reduce the weight of historical data on decision-making by decay average (EWMA).

Secondly, as we discussed, unexpected unloaded can make all intelligent load balancing strategies make wrong decisions, which should actually be another topic: service circuit breaker. However, service circuit breaker has cost. Before that, we still have to face the impact of wrong decisions. Therefore, in mosn and the current implementation, we use the decay failure rate to fix the error delay caused by load reduction (in In mosn, we also support configuring bias to enhance error perception).

Finally, we did not use the weighted load balancing method in the traditional sense, but used the randomization method (P2C), the weight is just as a factor to calculate weighted active requests. The intention here is the same as mentioned above. We want to avoid errors caused by overload of a certain instance. In P2C, the probability of each instance being selected is not the proportion of its weight in the set, but the position of its score in the set. The higher the position, the greater the probability, the probability of the same position is the same (I am not a math major, so forgive me for not being able to give a specific probability distribution)

tonya11en commented 5 months ago

First, the LEAST_REQUEST load balancer (LRLB) calculates the expected completion time of all tasks based on the number of active requests. It assumes that all servers have the same processing time, but in fact different servers have different processing times (limited by the processor model , NUMA nodes, network distance, etc.).

This isn't quite right- take a look at the LRLB's documented behavior. If all endpoint weights are identical, we use a P2C selection based on the number of active requests. If endpoint weights are heterogeneous, it's only then that we scale the weights based on the number of active requests as described in the formula in the docs (we also have a bias parameter).

The LRLB doesn't exactly calculate an expected completion time of anything or assume that the nodes are homogeneous. You might be mistaking this with the virtual time used in the EDF scheduler, which is used to decide which node to select next in the schedule. If a node cannot keep up with the amount of traffic being sent to it (high processing times), the average number of in-flight requests is expected to increase, which would result in a decrease in traffic sent to that node. Assuming the global requests per second is unchanging, average request latency and in-flight requests (concurrency) should be proportional. This is what we'd expect based on Little's Law.

I'd recommend taking a look at our blog post on how some of the LB algorithms work in Envoy.

Secondly, as we discussed, unexpected unloaded can make all intelligent load balancing strategies make wrong decisions, which should actually be another topic: service circuit breaker. However, service circuit breaker has cost. Before that, we still have to face the impact of wrong decisions. Therefore, in mosn and the current implementation, we use the decay failure rate to fix the error delay caused by load reduction (in In mosn, we also support configuring bias to enhance error perception).

I'm not sure what you mean by "unexpected unloaded" or service circuit breakers here. Are you referring to the same circuit breakers you can configure in the Envoy cluster config? We already have mechanisms for shedding load in the presence of errors, such as outlier detection and admission control.

Can you be explicit about what you mean when you refer to:

"unexpected unloaded"
service circuit breaker
error delay

I'm also a bit confused about whether you want to use endpoint latencies here or error rates. It doesn't appear like you and @liorfranko are proposing the same thing, so I'd like to request 2 things from you:

Please write documentation for this proposed load balancing algorithm first as if it were to go in the Envoy docs' supported load balancer section. That's roughly the level of detail I'm expecting here.
Given that the active request count and request latencies are proportional, where and how is the existing LEAST_REQUEST load balancer in Envoy falling short and how will this new algorithm address its shortcomings?

I'll conclude by saying that I'm not opposed to introducing a new load balancing algorithm here! We just need to clearly articulate how the existing set of algorithms is insufficient for the case you presented (where there is a heterogeneous set of hosts with different capabilities). I'll also mention that even if we determine this new algorithm isn't necessary, you can always write a load balancer extension and use it in your own deployments (analogous to writing a custom filter).

jizhuozhi commented 5 months ago

Hello @tonya11en, I'm glad to receive your reply. I understand the current issues and recognize that this is a necessary discussion process to ensure the healthy development of the community.

What I need to clarify is that I understand how Least Request is implemented in envoy, but what I want to express is that usually Least Request is abstracted as a service queue, and we want to find the shortest queue to ensure that the current request is completed as soon as possible (which also abstracted in nginx's blog). So I say Least Request calculates the expected end time assuming that all instances have the same processing time.

Can you be explicit about what you mean when you refer to:

"unexpected unloaded" service circuit breaker error delay

As for "unexpected unloaded", this is literal. For example, most modern application services need to read data from the database before performing complex calculations. However, if a single instance problem causes continuous failure to read database data, then it will no subsequent complex calculations be performed. At this time, the performance of this instance is "unexpected unloaded", and the delay of the request at this time is the "error delay". This "error delay" should not be calculated as normal service latency, and the instance with the problem should be avoided from being selected to trigger "service circuit breaker" (therefore, we provide a separate bias for the error rate).

Given that the active request count and request latencies are proportional, where and how is the existing LEAST_REQUEST load balancer in Envoy falling short and how will this new algorithm address its shortcomings?

As for the comparison between Least Request and PeakEWMA, I cannot give our online indicators, I cannot give our online indicators, but I can give the approximate curves of our online services using different load balancing algorithms when single-instance problem occurred, the service of one of our two thousand 16c32G instances (including Intel and AMD processors of different models and production batches, and hosts with different workloads, etc.), and single-instance problems will cause requests for this instance to frequently timeout (from real alarms)

We have gained this experience from our online indicators: Although the number of active requests per instance is different, and some may be higher than other instances, because modern services are executed in parallel by multiple processors, when the number of active requests is less than a const N, the delay will not increase due to the increase in the number of active requests, but the historical delay can reflect the performance trend of the instance. Therefore, the PeakEWMA algorithm can forward the request to an instance with a lower delay as much as possible without triggering queuing.

tonya11en commented 5 months ago

What I need to clarify is that I understand how Least Request is implemented in envoy, but what I want to express is that usually Least Request is abstracted as a service queue, and we want to find the shortest queue to ensure that the current request is completed as soon as possible (which also abstracted in nginx's blog). So I say Least Request calculates the expected end time assuming that all instances have the same processing time.

It's fine to think of the number of outstanding requests from the Least Request balancer (LRLB) point-of-view as a queue for this discussion, but just note that the observed behavior depends on how the applications handle concurrent requests. Given this, I understand why you would say that the LRLB assumes all instances have the same processing time for a request, but it's not quite accurate to think this way. We make no assumptions about an instance's processing time.

The LRLB simply favors endpoints with less in-flight requests via P2C selection. This means we are balancing the number of in-flight requests to each instance from each LB's perspective. I keep bringing up Little's Law in my comments because it is useful for understanding the relationship between the avg queue size, avg request latency, and avg RPS:

avg_queue_size = avg_latency * avg_RPS

The LRLB is trying to keep the avg_queue_size balanced across the endpoints in the host set. This means if some of the endpoints have elevated avg_latency relative to others, the queue size will grow if the rate of requests to that endpoint stays the same. So to keep things balanced, less requests will be sent to the high-latency endpoints to keep the in-flight requests balanced.

Errors are accounted for via outlier detection, which should be on by default.

As for the comparison between Least Request and PeakEWMA, I cannot give our online indicators, I cannot give our online indicators, but I can give the approximate curves of our online services using different load balancing algorithms when single-instance problem occurred, the service of one of our two thousand 16c32G instances (including Intel and AMD processors of different models and production batches, and hosts with different workloads, etc.), and single-instance problems will cause requests for this instance to frequently timeout (from real alarms)

I'm not sure I understand this as it's missing several important details:

Is there a unique set of hosts being used for each LB algorithm in the graphs?
Is the composition of the fleet the same across each LB algorithm in the graph? For example, 20% slow hosts 80% fast, etc.
What were the in-flight request counts and RPS graphs showing for the instances across these tests?
Were the instances for all tests running the same workloads/service?

This also doesn't really explain how EWMA LBs would perform better. Can you just provide a simple scenario where the EWMA LB is expected to work better than LR LB? For example, let's say you have 3 hosts and you want to balance requests across them. What characteristics would these hosts have that Least Request will not work for you?

Also, please remember from my last comment that you can write a load balancer extension and experiment with any algorithm you want. If the algorithm you implement there is shown to behave better in your Envoy deployment, that data would be very valuable for this discussion.

Buffer0x7cd commented 3 months ago

Not the original commenter, but I wanted to chime in with some of my opinions.

Load is not what you should balance: Introducing Prequa paper goes into more detail about where latency-based (or a combination of latency and in-flight request) load balancing outperforms the power of two choices, especially at high load. Section 5.2 Replica selection rule does a good job of comparing the effect on tail latency by different load balancing algorithms.

One interesting idea from the paper is to use a combination of in-flight requests and latency to convert it into an approach called hot and cold. To directly quote from the paper:

To minimize both latency and RIF, Prequal selects replicas using a hot-cold lexicographic (HCL) rule that labels probes in the pool as either hot or cold based on their RIF value. Prequal clients maintain an estimate of the distribution of RIF across replicas, based on recent probe responses. They classify pool elements as hot if their RIF exceeds a specified quantile (QRIF) of the estimated distribution; otherwise, cold. In replica selection, if all probes in the pool are hot, then the one with the lowest RIF is chosen; otherwise, the cold probe with the lowest latency is chosen. Our experiments (§5.2) suggest that QRIF ∈ [0.6, 0.9] is a good choice, although even 0 is effective (i.e., RIF-only control).

I think the above paragraph does a good job of explaining the situations where one might want to use latency as a signal for load balancing rather than active requests.

When the system is in steady state (meaning there is no queuing or close to zero queuing in the application stack), it's rare to see the majority of backends having non-zero active requests unless the load balancing is done via a central load balancer (in which case client in-flight requests == server in-flight requests) or pushing for very high capacity. This becomes even more likely when the size of backends is small (example: Python services running with uWSGI, where they are often configured to run with a large number of replica count but each replica has a small number of workers like 4 or 8). So when the LB is presented with a choice to send a request to 2 backends with zero in-flight requests, then choosing the one that has lower latency could end up being better.

Let me know what's your thoughts are on this, Personally i would love to help on this.

tonya11en commented 3 months ago

That's some convincing data! I'll need to do a more in-depth read, but from my quick pass the data in section 5.2 makes a good case for implementing the Prequal algorithm.

Rather than reinventing the wheel w.r.t. the probes, I'd recommend figuring out how the Load Reporting Service can be used. Let me know your thoughts.

@Buffer0x7cd if you are willing to take the lead on this, I'd be happy to provide guidance and review. Are you familiar enough with the codebase to talk through how you imagine the implementation to work?

Buffer0x7cd commented 3 months ago

Yeah, LRS was the first thing that came to my mind when thinking about an alternative to implementing probes. Although I have a few questions about including load in the feedback loop.

Currently, I can think of two approaches to include load info:

As you mentioned, using LRS to propagate load to a management server, which can implement the correct algorithm and push the updates to envoy servers via EDS. Looking at some of the past issues, I came across a document which suggests a similar design. I am thinking we can use the updates from EDS to implement what they refer to in the paper as a probing pool. Each Envoy server receives a list of endpoints which it can use for host selection.
Piggybacking the load information as part of the response in the form of a header and using the data provided in the header to adjust the weight of the endpoints. This seems like a popular choice to solve load balancing issues which arise due to the client's view of load vs. the server's view of load, especially on systems where the number of clients can be close to the number of backend tasks (like sidecars). Some blogs I came across discuss using a similar strategy from Uber, Dropbox, and Netflix. In this design, we can offload the calculation of load to servers (based on in-flight requests, latency in the last X seconds). Although I'm not sure about the implications of changing endpoint weights after each request (I am still trying to wrap my head around the load balancing implemented in Envoy, but based on some cursory reading, it seems like the weights are implemented as a min-heap when all weights are unequal). So if we try to update the weights dynamically (assuming we implement this as an Envoy filter that updates the weight as part of the response cycle), that can create issues?

Are you familiar enough with the codebase to talk through how you imagine the implementation to work?

I haven't worked with C++ since it hasn't been needed for my work for a long time, so I will probably spend a few weeks trying to get familiar with the code base.

Let me know your thoughts.

tonya11en commented 3 months ago

@Buffer0x7cd don't worry about the C++ details, for now. Let's first just flesh out a design for how probes would work and clarify the algorithm. I can help with the load balancer implementation details if you can take point on figuring out how probes would work.

I haven't gotten around to giving the paper more attention yet, but one of the things I want to make sure of is that this makes sense to implement in the Envoy's load balancer (as opposed to it being a control plane feature). We just can't have a scenario where the control plane goes down and the load balancing algorithm stops working. If there's no way to avoid that, we should look at possibly implementing this in Istio or thinking about another way to get this kind of functionality into Envoy.

Buffer0x7cd commented 3 months ago

Thanks, I'll try to spend some time this week fleshing out how probes will work. Alternatively, I'm also planning to read the C3 paper (it's mentioned in the prequa paper and has very similar performance to probes) to see if the approach mentioned in C3 is a better fit.

From looking at the presentation, the main difference I can see is that C3 doesn't rely on probes and instead uses the load info propagated by servers (in-band via headers) to select the best replica. It's currently implemented in Cassandra, and Spotify has used the same approach in their own in-house proxy ELS - Part 1.

but one of the things I want to make sure of is that this makes sense to implement in the Envoy's load balancer (as opposed to it being a control plane feature). We just can't have a scenario where the control plane goes down and the load balancing algorithm stops working

One question i wanted to ask is whether propagating the load info in-band ( like part of response headers) falls under data-plane or control plane in your opinion ?

In my view, both Prequal and C3 rely on backends calculating their own load and finding a way to propagate that load, either through a control plane or via in-band response headers. This seems more aligned with the data plane, as it resembles the use of headers like x-envoy-degraded, which allows backends to provide feedback to the envoy. I agree that cooperation from the backends is necessary for effective load balancing, as decisions based solely on data gathered from the client side may be suboptimal due to the limited view of the client.

So, from my perspective, we encounter two main challenges:

How to accurately calculate load, based on various metrics like request in flight (RIF) and latency. I believe this is best handled on the server side since clients only have a partial view of the traffic. This issue becomes more pronounced in scenarios with a large number of clients, such as in a sidecar deployment model.
How to effectively propagate load information back to the envoy, enabling it to inform load balancing policies. This could be achieved through out-of-band methods, like a control plane or lightweight probes, or in-band methods, such as propagating the information via response headers similar to C3 or ELS.

I still have to read a bit on the 2nd question before fleshing out some details how that could work, but curious to understand if you have a different opinion on the first problem.

tonya11en commented 3 months ago

I agree with you on what the main challenges are. IMO for problem 1, these measurements must be done on the server side. This is more-or-less a control loop similar to the adaptive concurrency filter. Systems like that require a "closed loop", so that means we can't have incomplete information about the load entering the system.

Propagating the load information via the server response headers might cause problems in systems with multiple hops. For example, a sidecar proxy that has to reach the backend endpoints via some intermediate gateway would really throw a wrench in the gears because the sidecar wouldn't have knowledge of which backend a request actually went to. This is why I'd prefer to use the control plane for propagating load information and adjustment of endpoint weights- the CP would get information straight from the backends and this information would be passed along to the downstream clients via EDS.

As we discuss this more, it's starting to look less like a data plane load balancing algorithm and more like a control plane use case for LRS. I still think this would be great, but we might want to think about where this feature would make the most sense. It could work in Envoy, but I could just as easily see this being an Istio (or other CP) feature too.

What is your use case?

Buffer0x7cd commented 3 months ago

What is your use case?

Currently least-conn doesn't work in systems where the number of load balancers are quite high, especially when the traffic flowing between client and servers is low (due to a large number of clients and servers). This is a common problem where small pods with higher replica count are preferred, and each replica has limited concurrency (like Uwsgi servers running with a small number of workers such as 4 or 8).

The above-mentioned Netflix blog does a good job explaining the problem, so I will quote the relevant lines here:

Another problem we experience with relying only on the client-side view is that for large clusters — particularly when combined with low traffic — a load balancer often has only a few in-use connections to a subset of the servers in a pool of hundreds. So when it’s choosing which server is least-loaded, it can often just be choosing between zero and zero — i.e., it has no data on the utilization of either of the servers it’s choosing between, so it has to just guess randomly.

In these kinds of setups, least-conn ends up devolving into random load balancing since the majority of endpoints have 0 in-flight requests based on the client's view. This creates an additional problem that now, we need more servers for the backend since random load balancing causes an increase in queuing when running the system with more than 30-40% utilization, amplifying the problem further and diverging the local state in the proxy further from the actual global state of the service.

The same issue happens when a client running with a higher number of backends (such as fan-out layers in a micro-service system) ends up calling another service since the high number of clients' tasks cause the local state to be diluted more (although subsetting does help in this case).

For such systems, having access to servers' utilization is helpful since it helps converge the local state of a proxy closer to the actual global state.

One alternative here is to proxy all the requests from local proxies via a middle layer of proxies, which have better local state and can converge more closely to the global state of the service (similar to what Istio ambient mesh does via waypoint proxies). Although it would be great if clients can have access to a better load balancing system without the need to run a central layer of proxies (which brings a lot of different challenges).

For example, a sidecar proxy that has to reach the backend endpoints via some intermediate gateway would really throw a wrench in the gears because the sidecar wouldn't have knowledge of which backend a request actually went to.

In such cases, shouldn't load balancing be happening at the intermediate gateway level since the sidecar doesn't have visibility into backends? Even with other load balancing methods such as Least conn, it would probably fail to do any meaningful work since all they have access to is the intermediate gateway?

I do think there is an advantage to having this built into the data-plane and having some documentation around it regarding what type of scenario this load balancing system should be used.

This is why I'd prefer to use the control plane for propagating load information and adjustment of endpoint weights- the CP would get information straight from the backends, and this information would be passed along to the downstream clients via EDS.

I think one of the upsides of propagating the load info in-band (example part of the response cycle) compared to propagating it out of band (via control plane) is that with in-band propagating, a client who is sending X requests per second will get a more updated response compared to a client sending Y responses (where X > Y). The frequency of load info scales with the outgoing requests from the client, which helps in making the most optimum decision. It's the same principle that is used in probes where they are scaled with the number of queries.

Consider an example where a client is sending 100 RPS while another client is sending 10 RPS. With a control plane scaling the load info with the number of requests is hard since it will be difficult for the control plane to guess the correct rate of update for each of the clients. Also for clients that have a medium to higher requests rate (like 100 RPS or 1000 RPS), the control plane will need a large amount of updates to make sure that the client's side load info is kept up to date with server-side load info. On the other hand, in-band load propagating does make this quite less complex since the rate of update for load info scales to the rate of requests, making sure that the clients have the most up-to-date view of the load on the backends and avoid overloading them as much as possible (although removing the possibility of inaccurate load balancing decision-making is impossible since there is always going to be some degree of inaccuracy between the local state of the client and the global state of a backend).

Another good point (although maybe less relevant here) is Envoy is deployed in a lot more environments even without Istio, so a mechanism to make such a system work would be a good enhancement to the load balancing subsystem (something similar to go-control-plane which uses LRS to propagate load info back to the control plane).

Although given the recent trend towards utilizing server-side load reporting (there are few publicly available blogs that I have added in my previous comments) to improve the load balancing decision-making, I am personally quite inclined towards going with an in-band load reporting approach. In my mental model, I have been thinking about this as an extension to weighted least conn load balancing policy. Even without the hot and cold approach, using a linear combination of RIF and latency as a scaling factor outperforms all other load balancing implementations, which is a big improvement compared to existing least loaded load balancing systems.

Another interesting point is the comparison with YARP-Po2C, which is a very similar system that you are suggesting in with control plane (mentioned in section 5.2). Even with a 500ms poll interval, it was still outperformed by both Linear and HCL implementation of prequal. Given the frequency of change (especially for request in-flight metric), I think a control plane would struggle to keep up the load information fresh given that it needs to support a large number of clients and with varying degrees of outgoing throughput.

let me know what are your thoughts here.

jizhuozhi commented 3 months ago

Sorry for not replying to this issue for a long time. I can add some additional information regarding the above discussion:

I have used the implementation of a similar mechanism. The service load is reported through the sidecar and the weight is dynamically calculated through the mixer, and then distributed through the control plane. Yes, this is what we often call server-side view. It brings us stability improvements. It can make the CPU utilization more balanced to avoid outliers, but it does not benefit much in terms of latency. (I think the main discussion of this issue is latency)

The reason is that the delay is composed of point-to-point network delay plus server delay (which is what we often call client-side view). In large-scale clusters, the network situation is very complex, including different hosts, switches, Different physical clusters of the same logical cluster, etc. A desensitized data, we observed one microservice through apm that the server-side view delay of upstream is about 5ms, but when combined with network communication, the client-side view delay is about 15ms (pct99).

tonya11en commented 3 months ago

It can make the CPU utilization more balanced to avoid outliers, but it does not benefit much in terms of latency. (I think the main discussion of this issue is latency)

@jizhuozhi do you have experimental or observational data to show this? This is at odds with the data/results presented in the paper linked in https://github.com/envoyproxy/envoy/issues/20907#issuecomment-2125866863.

@Buffer0x7cd you make a compelling case for doing this in the data plane via information passed along in the response headers. Do you want to sketch out what this looks like at a high level? More specifically:

What would the configuration look like?
How does the server calculate and represent how loaded it is?
What would happen in cases where some servers don't send the load reports back in their responses? For example, if the endpoint is something like a database without an Envoy sidecar fronting it?

Buffer0x7cd commented 3 months ago

I have used the implementation of a similar mechanism. The service load is reported through the sidecar and the weight is dynamically calculated through the mixer, and then distributed through the control plane. Yes, this is what we often call server-side view. It brings us stability improvements. It can make the CPU utilisation more balanced to avoid outliers, but it does not benefit much in terms of latency. (I think the main discussion of this issue is latency)

I think the implementation @jizhuozhi is mentioning propagate the load information via control plane ( unless i misunderstood it) which is different than what's mentioned in the paper ( the closest one is YARP-Po2C which uses server side load report for selecting the backend)

Sure @tonya11en , I will try to sketch the high level details about how the system might look.

jizhuozhi commented 3 months ago

I have used the implementation of a similar mechanism

@tonya11en , the similar mechanism is not comparing with Prequal but "adjustment of endpoint weights" as above, yes, as @Buffer0x7cd said "propagate the load information via control plane"

@jizhuozhi do you have experimental or observational data to show this?

I'm sorry that I can't give out experimental or monitoring data without permission from the company's legal department, so I can only describe the desensitized data through text as much as possible.

markdroth commented 3 months ago

2. Piggybacking the load information as part of the response in the form of a header and using the data provided in the header to adjust the weight of the endpoints. This seems like a popular choice to solve load balancing issues which arise due to the client's view of load vs. the server's view of load, especially on systems where the number of clients can be close to the number of backend tasks (like sidecars). Some blogs I came across discuss using a similar strategy from Uber, Dropbox, and Netflix. In this design, we can offload the calculation of load to servers (based on in-flight requests, latency in the last X seconds). Although I'm not sure about the implications of changing endpoint weights after each request (I am still trying to wrap my head around the load balancing implemented in Envoy, but based on some cursory reading, it seems like the weights are implemented as a min-heap when all weights are unequal). So if we try to update the weights dynamically (assuming we implement this as an Envoy filter that updates the weight as part of the response cycle), that can create issues?

Just wanted to mention that if we do go with the above approach, we should make use of the ORCA protocol described in #6614. Note that gRPC does use ORCA data in our WRR LB policy, as per gRFC A58: weighted_round_robin LB policy.