Support LB priority for endpoints

guydc commented 5 months ago

Description: Envoy supports routing priorities:

When load balancing traffic, envoy uses healthy endpoints (as determined by configured health checks).
Users can assign priorities to endpoints, determining which healthy endpoints are preferred for load balancing.

This allows users to configure a failover mechanism from primary to secondary endpoints.

Envoy Gateway currently configures a default priority for all endpoints: https://github.com/envoyproxy/gateway/blob/f9409e477e866a012757bb9eb4cbc9ac35c7f3a7/internal/xds/translator/cluster.go#L408

LB Priority can be implemented by extending existing Gateway-API resource:

BackendRefs may support a priority field, similar to the existing weight field.
The BackendLBPolicy resource may be extended to support additional LB considerations, such as priorty.

Other options:

EG can support this as an additional attribute of the extended BackendRef proposed in #3069
EG can support a custom BackendRef target, which will include a priority field, such as the proposed Backend API: #3063
EG BackendTrafficPolicy can be used to assign priorities. For example, the policy will hold a map of BackendRefs to priorities. Or, the policy can attach directly to Service resources.

arkodg commented 5 months ago

ideally prefer if this was in upstream relates to https://github.com/kubernetes-sigs/gateway-api/discussions/2304 & https://github.com/kubernetes-sigs/gateway-api/issues/1802

arkodg commented 5 months ago

if we add into field into non xRoute backendRefs, we may need to pluraize existing instantiations to backendRefs

ext auth
otel cc @guydc @zirain

guydc commented 5 months ago

if we add into field into non xRoute backendRefs, we may need to pluralize existing instantiations to backendRefs

Yes, also ext-proc. I don't object, and this is in-line with gateway-api. Do you think that we should consider a different intermediate approach?

arkodg commented 5 months ago

I'd vote to pluraize the backendRefs for non xRoute asap while we are in alpha, which can be a good place to test the priority / standby use case

github-actions[bot] commented 4 months ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days.

modatwork commented 4 months ago

This could be used to solve https://github.com/envoyproxy/gateway/issues/1909, though it may require further refinement. Specifically, the Envoy Gateway should allocate distinct priority levels to the same upstream endpoint, varying by the Envoy Pod’s location in different availability zones. For instance, for an identical endpoint in zone A, the Envoy Pod situated in zone A should be assigned priority 0, whereas Envoy Pods in zones B and C should be assigned priority 1.

Do you think Envoy Gateway will handle this? Or leave it to the user?

guydc commented 4 months ago

Do you think Envoy Gateway will handle this? Or leave it to the user?

We briefly discussed this scenario in the community meeting. It seems like this will require pushing different config to different envoy proxies based on zone information. For the time being, we'll focus on supporting the simpler use case described above.

aoledk commented 3 months ago

As mentioned in https://github.com/envoyproxy/gateway/issues/1909#issuecomment-2126782250, Envoy in different zones may use same backend endpoint as different priority to reduce cross-zone cost.

It's a required feature for production adoption. Does community have roadmap on this feature ?

aoledk commented 3 months ago

https://github.com/envoyproxy/gateway/issues/1909#issuecomment-2128073855 says:

thanks for outlining the steps @aoledk ! we currently have #3055 open to get explicit priority per backendRef and program that into the xds cluster resource.

In the future, we can use this issue to make sure we track the auto priority work, the field in k8s preferClose could be the knob for users to say they want to opt in to this feature

@arkodg One backendRef (mainly Service) will refer to endpoints from multiple zones, priority should not be applied on backendRef, unless we use selector to collect endpoints in same zone into single backendRef.

Besides, same endpoint will have different priority in Envoy.

So maybe we can't assign a fixed priority for backendRef, codes will be needed to assign right priority for specific Envoy.

arkodg commented 3 months ago

there are two use cases for priority

the one you've highlighted where zones represent physical locations (done for HA purposes at the node level) so location aware routing reduces overall latency as well as cost (cross zone or region can be billed by cloud provider)
creating a primary and backup backendRef for HA cases (at the app/backend level)

aoledk commented 3 months ago

@arkodg Can we add a new xDS Hook API, like PostEndpointModify(ClusterLoadAssignment, Node) which allow extension server to assign endpoint priority for individual envoy. Leave these tasks to EG adopters themselves.

aoledk commented 3 months ago

hey @aoledk thanks for driving this work, before we implement this, we as a community need to align on what the feature is, is it

Auto Zone Aware Routing where a user sets a global field to enable this, and envoy proxy prioritizes endpoints in the same zone based on its zone ? Open Questions:

is this granular enough ? there are only 2 levels here - local or not local

Priorites for BackendRefs - Prioritizes one backendRef over another backendRef within the same HTTP Rule

@arkodg I've quoted your https://github.com/envoyproxy/gateway/issues/3538#issuecomment-2153387620 here to continue discussion about priority. I'll freeze #3542 until we reach agreement.

Assuming two cases demonstrated in image below.

Case A: priority isn't in gateway-api, EG will generate #1, #2 LocalityLbEndpoints.

Case B: priority is in gateway-api, EG will generate #3, #4, #5, #6 LocalityLbEndpoints.

Priority

Code in xds/cache will be used to modify priority for individual envoy based on envoy's running topology.

https://github.com/envoyproxy/gateway/blob/33fceb05ca611e30107ff841cac9d7762eea2048/internal/xds/cache/snapshotcache.go#L92-L96

We can have 2 options to enable topology aware routing ( NOT envoy zone aware routing ) based on priority.

Option 1

Add knob into BackendTrafficPolicy to enable topology aware routing for affected BackendRefs.

If we enable in Case A, #1, #2 LocalityLbEndpoints for envoy running in zone a will be modified by xds/cache like this:

#1 P=0
#2 P=1

If we enable in Case B, #3, #4, #5, #6 LocalityLbEndpoints for envoy running in zone a will be modified by xds/cache like this:

#3 P=0
#4 P=1
#5 P=2
#6 P=3

But this option has several challenges.

Challenge 1

The new knob is accessible in xds/translator, but xds/translator only pass xds to xds/server, then to xds/cache, a new communication channel need to be setup to pass this particular knob to xds/cache. It will add additional dependency between those modules.

Challenge 2

Should we allow user to enable topology aware routing for BackendRefs specified with priority ?

If yes, that means xds/cache will modify LocalityLbEndpoints's priority specified by user.

Option 2

Add knob into EnvoyGateway to enable topology aware routing for all BackendRefs.

If enabled, #1, #2, #3, #4, #5, #6 LocalityLbEndpoints for envoy running in zone a will be modified by xds/cache like this:

#1 P=0
#2 P=1
#3 P=0
#4 P=1
#5 P=2
#6 P=3

Challenge 1

Same to Option 1's Challenge 2.

is this granular enough ? there are only 2 levels here - local or not local

LocalityLbEndpoints can be set with region / zone / sub_zone:

zone can be retrieved from EndpointSlice directly.
region can be retrieved like this: use nodeName in EndpointSlice to query Node, retrieve region from Node topology.kubernetes.io/region label.
sub_zone` can't be retrieved.

Envoy can be set with region / zone / sub_zone.

Since Pod downward API doesn't expose Pod zone, region and zone can be retrieved like this: use nodeName via Pod downward API to query Node, retrieve region and zone from Node topology.kubernetes.io/region and topology.kubernetes.io/zone labels.
sub_zone can't be retrieved.

So region can be used as level bigger than zone technically, we can discuss whether we need to use region on topology aware routing.

arkodg commented 3 months ago

thanks for the detailed analysis @aoledk ! 1) looks like the "priority is in gateway api backendRef" case is simpler and straightforward and doesnt need zone based endpoint aggregation

Auto Zone Aware Routing 2) if we okay with 2 level priorities then envoy's inherent zone aware routing feature is the way to go Pros

easier to build / less complex to maintain and debug Cons
only 2 levels of priority - local to envoy or not 3) if we want multi level priority based on region / zone / sub zone, we'll have to key EDS o/p in the control plane based on envoy locality and start maintaining this extra dimension Pros
more levels of priority Cons
need to maintain per locality endpoint cache

As next steps, here is what I propose

Decide on which feature we want to prioritize first or both ? This is a great roadmap item for v1.2 (Oct 2024)
For Zone Aware Routing gather more feedback on how many levels of priority / locality we want to support

github-actions[bot] commented 2 months ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days.

alexwo commented 1 month ago

thanks for the detailed analysis @aoledk !

looks like the "priority is in gateway api backendRef" case is simpler and straightforward and doesnt need zone based endpoint aggregation

Auto Zone Aware Routing 2) if we okay with 2 level priorities then envoy's inherent zone aware routing feature is the way to go Pros

easier to build / less complex to maintain and debug Cons

only 2 levels of priority - local to envoy or not

if we want multi level priority based on region / zone / sub zone, we'll have to key EDS o/p in the control plane based on envoy locality and start maintaining this extra dimension Pros

more levels of priority Cons

need to maintain per locality endpoint cache

As next steps, here is what I propose

Decide on which feature we want to prioritize first or both ? This is a great roadmap item for v1.2 (Oct 2024)

For Zone Aware Routing gather more feedback on how many levels of priority / locality we want to support

1 sounds straight forward and can bring value, we can extend this struct with an endpoint priority field: https://github.com/envoyproxy/gateway/blob/a43cc6c96ac147ce58db9714283e77d237c3d1e3/api/v1alpha1/shared_types.go#L471

Similar to how gwapiv1.BackendObjectReference has "weight". Then, translate this value to an envoy endpoint priority: https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/endpoint/v3/endpoint_components.proto#envoy-v3-api-msg-config-endpoint-v3-localitylbendpoints

We can also have an E2E test for this, to demonstrate that when a high-prio backend is down, traffic is steered to the low prio backend.

This will only work for "technical" backends at the moment, like ext-auth/proc/observability sinks ...

If the direction sounds good, I can get started with adding it?

arkodg commented 1 month ago

hey @alexwo thanks for lookiing into this ! im hoping we can weigh in pros/cons of an explicit priority field vs adding another field like backupBackendRef

note for @aoledk who is looking into Zone Aware Routing, priority routing is mutually exclusive to it, and EG should be able to do zone aware routing for P=0 Hosts

arkodg commented 1 month ago

GCP expresses failover using a bool failover field inside backend definition https://cloud.google.com/compute/docs/reference/rest/v1/backendServices

alexwo commented 1 month ago

GCP expresses failover using a bool failover field inside backend definition https://cloud.google.com/compute/docs/reference/rest/v1/backendServices

Sure, we can simplify and offer the same level of flexibility as done in GCP. So that via the API layer we expose just a boolean and manage as priority at the translation layer, allowing for a priority of: primary / backup (0,1).

arkodg commented 1 month ago

nginx has a backup keyword It uses https://nginx.org/en/docs/http/ngx_http_upstream_module.html

I'm a +1 for the boolean option

also @alexwo another design decision is around when does an endpoint get removed from the load balancing pool and when does it get re added ? is the user responsible for setting up healthCheck (active or passive) which results in endpoints being removed/readded from the pool ?

guydc commented 1 month ago

+1 for a boolean as well.

when does an endpoint get removed from the load balancing pool and when does it get re added

I imagine that K8s service endpoints would be added/removed automatically based on K8s pod probes and status. Users can additionally opt-in and define active/passive healthchecks using backend traffic policy. This is an envoy recommendation for EDS: https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/upstream/service_discovery#on-eventually-consistent-service-discovery

For non-service backends (Unix Socket, FQDN, ...) - users must to define active/passive healthchecks in order for priorities to actually have an impact.

alexwo commented 4 weeks ago

Sounds good @arkodg / @guydc,

I have updated: https://github.com/envoyproxy/gateway/pull/4033 to reflect the “designated failover backend” boolean approach.

envoyproxy / gateway