envoyproxy / gateway

Manages Envoy Proxy as a Standalone or Kubernetes-based Application Gateway
https://gateway.envoyproxy.io
Apache License 2.0
1.59k stars 348 forks source link

Locality Based Routing Support #1909

Open tanujd11 opened 1 year ago

tanujd11 commented 1 year ago

Description: Implement locality based routing support by default in EG. Now that we we can have individual endpoints as backend to EG. Can we support region/zone/subzone based routing based on EndpointSlice information, node labels etc.?

arkodg commented 1 year ago

Hey @tanujd11 from a user perspective can you share what you like to happen on the data plane ( from gateway to multiple backend endpoints with different topology info )

arkodg commented 1 year ago

I understand this is very useful for optimizing East West traffic within a cluster, is that also the case for north South ?

tanujd11 commented 1 year ago

I think for an Envoy gateway running in us-east-1/us-east-1a should prefer the same zone backend to prevent cross zonal traffic. I think this behaviour could be made as default as cross zone communication is obviously costly. WDYT?

arkodg commented 1 year ago

thanks, here's something more to think about

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days.

github-actions[bot] commented 11 months ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days.

github-actions[bot] commented 9 months ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days.

arkodg commented 5 months ago

there's a new field in the Service spec (trafficDistribution.preferClose) https://kubernetes.io/docs/concepts/services-networking/service/#traffic-distribution that we could consider using to automate priority amongst endpoints within a Service

aoledk commented 5 months ago

there's a new field in the Service spec (trafficDistribution.preferClose) https://kubernetes.io/docs/concepts/services-networking/service/#traffic-distribution that we could consider using to automate priority amongst endpoints within a Service

Could be an option when this new field is stable and corresponding K8s version is adopted by massive companies.

Before that, IMO it's better to do load balancing accross endpoints in the cluster via Envoy's capability.

Currently EG has implemented locality weighted load balancing ^1, one BackendRef is translated to one LocalityLbEndpoints.

locality := &endpointv3.LocalityLbEndpoints{
    Locality: &corev3.Locality{
        Region: fmt.Sprintf("%s/backend/%d", clusterName, i),
    },
    LbEndpoints: endpoints,
    Priority:    0,
}

// Set locality weight
var weight uint32
if ds.Weight != nil {
    weight = *ds.Weight
} else {
    weight = 1
}

Actually endpoints inside a LocalityLbEndpoints may be running in different zone, cross zone cost can't be saved in this way.


Through Envoy's capability, priority levels ^2 or zone aware routing ^3 can archive the goal to save cross zone cost.

priority levels

  1. Backend endpoint should be set with correct zone, it can be retrieved from EndpointSlice, inherited from Node topology.kubernetes.io/zone label.
  2. Envoy's command options should be set with --service-zone option, means which zone Envoy Pod is running in.
  3. EG rearranges EDS resources for each Envoy, if Envoy and Backend endpoint are in same zone, priority as 0, else 1.

zone aware routing

This approach is mutually exclusive with locality weighted load balancing, since in the case of locality aware LB, we rely on the management server to provide the locality weighting, rather than the Envoy-side heuristics used in zone aware routing.

  1. Backend endpoint should be set with correct zone, it can be retrieved from EndpointSlice, inherited from Node topology.kubernetes.io/zone label.
  2. Envoy's command options should be set with --service-zone option, value meaning which zone Envoy Pod is running in.
  3. Envoy's bootstrap config should be set with cluster_manager. local_cluster_name, means which fleet Envoy Pod belongs to, it will be irKey in implementation.
  4. Add cluster corresponding to cluster_manager. local_cluster_name to CDS resources.
  5. Design a mechanism to discover Envoy Pods belongs to cluster_manager. local_cluster_name as endpoints and add them to EDS resources.
  6. Both Envoy and Backend cluster are not in panic mode ^5.

personal preference

Since step 1 and 2 is required by both, priority levels can work with implemented locality weighed load balancing, but zone aware routing can't. Apparently priority levels are easier to implement. But it requires EDS resources should be arranged in xds/cache module for individual Envoy. No matter EG do this, or create new xDS Hook API, like PostEndpointModify(ClusterLoadAssignment, Node) which allow extension server to do this.

arkodg commented 5 months ago

thanks for outlining the steps @aoledk ! we currently have https://github.com/envoyproxy/gateway/issues/3055 open to get explicit priority per backendRef and program that into the xds cluster resource.

In the future, we can use this issue to make sure we track the auto priority work, the field in k8s preferClose could be the knob for users to say they want to opt in to this feature

guydc commented 5 months ago

Hi @aoledk, regarding:

priority levels [...] EG rearranges EDS resources for each Envoy, if Envoy and Backend endpoint are in same zone, priority as 0, else 1.

Is this option viable? Can our XDS server produce different EDS for different envoy pods that are part of the same Envoy deployment?

modatwork commented 5 months ago

I think it's possible. xDS server can read the locality info of envoy node.

The cache will be keyed based on a pre-defined hash function whose keys are based on the Node information.

// Identifies a specific Envoy instance. Remote server may have per Envoy configuration.
message Node {
  // An opaque node identifier for the Envoy node. This must be set.
  string id = 1;
  // The cluster that the Envoy node belongs to. This must be set.
  string cluster = 2;
  google.protobuf.Struct metadata = 3;
  Locality locality = 4;
  // This is motivated by informing a management server during canary which
  // version of Envoy is being tested in a heterogeneous fleet.
  string build_version = 5;
}
guydc commented 5 months ago

Thanks for pointing that out @modatwork. My other concerns wrt. to this approach are:

In general:

Is there a reason to prefer the Priority-based approach? I'm not sure that it's significantly simpler than enabling zone-aware routing.

arkodg commented 5 months ago

is @modatwork the same person as @aoledk :) ?

Possible impact on memory consumption if we have to maintain a copy of the cache for each locality. Not sure if that's already the situation today. @arkodg - do you know?

@guydc we have are dumuxing on gateway/IR, with locality it would add another dimension lookup and would increase memory by num localities total (xds per gateway gateway resources)

aoledk commented 4 months ago

@arkodg I work together with @modatwork

github-actions[bot] commented 3 months ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days.

arkodg commented 3 months ago

hey @aoledk , adding this issue to the v1.2 milestone, is this something you can help with ?

  1. Lets configure zone aware routing in envoy by default https://www.envoyproxy.io/docs/envoy/latest/faq/configuration/zone_aware_routing
  2. If a Service has TrafficDistribution set to PreferClose https://kubernetes.io/docs/concepts/services-networking/service/#traffic-distribution, lets rearrange the EDS endpoint (so Service opts in)
aoledk commented 3 months ago

hey @aoledk , adding this issue to the v1.2 milestone, is this something you can help with ?

  1. Lets configure zone aware routing in envoy by default https://www.envoyproxy.io/docs/envoy/latest/faq/configuration/zone_aware_routing
  2. If a Service has TrafficDistribution set to PreferClose https://kubernetes.io/docs/concepts/services-networking/service/#traffic-distribution, lets rearrange the EDS endpoint (so Service opts in)

@arkodg I can help.

arkodg commented 3 months ago

awesome thanks @aoledk !

github-actions[bot] commented 2 months ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days.

arkodg commented 1 month ago

hey @aoledk still planning on working on this one for v1.2 ?

aoledk commented 1 month ago

hey @aoledk still planning on working on this one for v1.2 ?

Hi @arkodg nowadays I'm working on bring in EG v1.1., next month I will continue on this feature, but not sure whether it can be merged into v1.2 (Due by October 30, 2024), maybe v1.3.

arkodg commented 1 month ago

thanks for the update @aoledk, let us know if you hit any issues while running EG v1.1 moving this issue into backlog

aoledk commented 1 month ago

@arkodg LGTM.

github-actions[bot] commented 1 week ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days.