Load balancer fallbacks to another cluster

yxue commented 5 years ago

~~When using EDS to update the endpoint configuration dynamically, DNS resolution is not allowed for hostname in EDS response, as the comment says:~~

// The form of host address depends on the given cluster type. For STATIC or EDS, // it is expected to be a direct IP address (or something resolvable by the // specified :ref:resolver <envoy_api_field_core.SocketAddress.resolver_name> // in the Address). For LOGICAL or STRICT DNS, it is expected to be hostname, // and will be resolved via DNS.

~~and the DNS is now allowed for customer resolver~~

// The name of the custom resolver. This must have been registered with Envoy. If // this is empty, a context dependent default applies. If the address is a concrete // IP address, no resolution will occur. If address is a hostname this // should be set for resolution other than DNS. Specifying a custom resolver with // STRICT_DNS or LOGICAL_DNS will generate an error at runtime.

When adding some external services to the load balancing endpoints via EDS. It would be impractical to add the IP of the endpoints considering service VMs can and will go up and down, and the service's endpoint IP addresses will change frequently. Supporting hostname in EDS response seems to be a reasonable solution.

~~Not sure if not supporting DNS resolution for hostname is by design or the Envoy restriction. Please let me know and I can help working on this feature if Envoy needs it.~~

In istio, we use priority filed to implement failover logic. When the endpoints in higher priority are down, the load balancer will select the endpoints with lower priority. It assumes that all the endpoints have the same settings (e.g. TLS context). But sometimes it may be different. For example, for external fallback service, mTLS is not required, but inside the service mesh, mTLS is required. If the external service endpoints and internal service endpoints are added into one cluster. The traffic to external endpoint will be broken.

Downstream Envoy setting:

clusters:
  - name: proxy
    type: strict_dns
    lb_policy: round_robin
    load_assignment:
      cluster_name: proxy
      endpoints:
      - lb_endpoints:
        priority: 1
        - endpoint:
            address:
              socket_address: 
                address: proxy
                port_value: 80
      - lb_endpoints:
        priority: 0
        - endpoint:
            address:
              socket_address:
                address: proxy
                port_value: 443
    tls_context: {}

Upstream listener setting:

  listeners:
  - address:
      socket_address:
        address: 0.0.0.0
        port_value: 80
      ...
  - address:
      socket_address:
        address: 0.0.0.0
        port_value: 443
      ...
      tls_context:
        common_tls_context:
          tls_certificates:
          - certificate_chain:
              filename: /etc/cert.pem
            private_key:
              filename: /etc/key.pem
          validation_context: {}

When the proxy:443 is down, the traffic to proxy:80 will be broken as well because proxy:80 doesn't support mTLS.

Thanks @PiotrSikora for the solution. Allow load balancer to fallback to another cluster would solve the problem. For above case, split the cluster configuration into two clusters and load balancer can select another cluster when one cluster is down and use the setting for the cluster.

cc @duderino @htuch @PiotrSikora @mattklein123

PiotrSikora commented 5 years ago

I'm pretty sure that's by design - if you want to use DNS, then stick to STRICT_DNS or LOGICAL_DNS clusters.

Going back to the real problem at hand (i.e. fallback to external service if all local endpoints are down), then I believe that the real solution is fallback to another cluster, and not hacking EDS. Notably, the fallback service might have very different settings (TLS certificates, timeouts, to name a few) than the local endpoints, so we need to have the ability to configure those.

mattklein123 commented 5 years ago

+1 to what @PiotrSikora said. If you want DNS I would deliver a DNS cluster over CDS.

mattklein123 commented 5 years ago

This came up recently and I thought we already had an issue on this but I can't quickly find it.

I don't think there is going to be any easy way to do this, especially if you want support across both TCP and HTTP. For HTTP we could likely do this via some type of retry plugin and allow the router to target a different cluster on retry. For TCP I suppose we could do something similar also.

It might also be possible to somehow build an aggregate cluster type that could contain multiple clusters and failover between them? @snowp any thoughts here?

yxue commented 5 years ago

Are you referring this issue? @mattklein123 https://github.com/envoyproxy/envoy/issues/7135#issue-451246689

I opened that issue for introducing opportunistic encryption. Due to security issue, Piotr suggested to close that one.

Between retry for HTTP/TCP and aggregate cluster, the latter seems to be more promising to me.

snowp commented 5 years ago

I think aggregate cluster + aggregate LB might be the easiest way to accomplish this, passing information in the LB context that can influence the decision of which cluster/lb to select. That way it would be fairly easy to reuse for multiple proxy filters (router, tcp_proxy, etc.)

htuch commented 5 years ago

Here's a Wild-and-Crazy-Idea; configure a listener on the Envoy process itself as a fallback in EDS, and have this listener TCP proxy to a DNS discovered cluster. This comes at some operational and CPU cost, but avoids having to build a massively complicated aggregated cluster type and LB, while providing the flexibility to do complicated fallback for the rare cases it is needed.

htuch commented 5 years ago

So, let's say the loopback idea ^^ is not going to fly because of the CPU cost (and needing to overprovision to support this for fallback). How many others folks in the Envoy community would benefit from a tiered LB approach? It would be good to get some additional motivation.

snowp commented 5 years ago

We would have benefitted from a tired LB approach a while back during our rollout to fall back to a not health checked VIP cluster after attempting to hit the actively health checked service mesh cluster. The health checking bit makes it difficult to express as retries within the cluster, since the fallback endpoints would also get health checked.

I'm not sure if we'd use it right now as we're mostly migrated at this point cc @mpuncel

mkeeler commented 5 years ago

This is something we are running into with Consul. We need different Upstream TLS context for some failover endpoints. Being able to failover instead to another cluster would work just as well allowing us to configuring the TLS context for that cluster separately.

htuch commented 5 years ago

Will tiered LB handle the use case that @mkeeler mentions? Maybe the simplest thing to do would be able to have Cluster failover rather than LB tiering.

mkeeler commented 5 years ago

@htuch That sounds perfect. We are trying to map the concept of failover into LB endpoints with priorities but being able to just specify an alternative cluster to use for failover would be a much closer match to the concept we are trying to implement.

mattklein123 commented 5 years ago

The problem with "cluster failover" is if we keep the clusters separate we will need to build config/code logic in all of the users to handle failover to the other cluster. This is doable but seems sub-optimal. Another option would be to have some kind of failover list within a cluster itself, and some ability to grab hosts from the failover cluster, but at that point it seems cleaner to me to just build the aggregate cluster type.

mkeeler commented 5 years ago

For Consul, the root of the problem really is that the UpstreamTLS context has to be the same for all endpoints in the cluster.

Failing over to another cluster is just one possible solution (that happens to conceptually map really nicely to what we are trying to accomplish)

If there was some subset of endpoints used for failover, that would work to so long as we could override the TLS context.

mattklein123 commented 5 years ago

If there was some subset of endpoints used for failover, that would work to so long as we could override the TLS context.

I think this is feasible, especially with the subset LB. I think it would be nice to clarify the actual end user cases for everyone following this thread and we can go from there?

htuch commented 5 years ago

@mattklein123 I think there are multiple requirements here:

Being able to failover to other endpoints (and potentially override settings like TLS context).
Being able to failover to a completely different service discovery mechanisms (e.g. the endpoints usually come from EDS but the failover endpoints come from DNS).

(2) is a requirement coming from Istio based on real-world use cases (not sure how much I can share about specifics). (2) is also seemingly harder to do with just a failover endpoint subset.

yxue commented 5 years ago

@htuch I think Istio's case covers both of them. Background is that our customer tries to failover from internal service to external service.

The internal service and external service have different TLS context. The internal service enables mTLS but the external service doesn't enable;
The external service endpoint also comes from DNS and internal service endpoints are provided by EDS. It would be impractical to add the IP of the endpoints considering service VMs can and will go up and down, and the service's endpoint IP addresses will change frequently.

mattklein123 commented 5 years ago

From everything I've seen here I still think the aggregate cluster is going to be the cleanest way of doing this and will give us a lot of flexibility. IMO it wouldn't be that bad to implement either.

incfly commented 5 years ago

/cc @incfly

Even though I know the https://github.com/envoyproxy/envoy/issues/7135 is closed, but thinking from another perspective as Istio, if we allow TLS settings to be override for a subset endpoints or a sub-cluster, the best effort mTLS adoption can be really simpler.

We no longer blindly retry TLS and fallback to plaintext if fails. The guided information from xDS server can guide the Envoy, endpoint subset A is able to serve mTLS, we do OE; if not, Envoy does not bother to do two attempts of TCP connection.

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.

yxue commented 5 years ago

Working on the design doc of aggregate cluster

yxue commented 5 years ago

aggregate cluster design doc

Also have an implementation for this design. https://github.com/envoyproxy/envoy/pull/7967

cc @mattklein123 @htuch @snowp @duderino PTAL

htuch commented 5 years ago

Thanks @yxue, I've left a couple of comments on the doc.

rshriram commented 5 years ago

Here is a much simpler and more extensible proposal rather than adding yet another type of a cluster.

Extend the filter api in the cluster to be able to decide / mutate the settings that need to be used for a given endpoint. Settings include TLS settings, http2 settings, etc. This filter callback needs to be invoked before making the connection. This is all envoy would have to do. I would not do anything else, else its additional baggage in Envoy core for no good reason.

We (in Istio) can then write a cluster level filter that takes advantage of this functionality. We will keep the filter in a resuable module so that the Consul folks can also use it. Please dont add this aggregate cluster which is confusing AF, and further complicates the load balancing behavior of a cluster.

rshriram commented 5 years ago

And in terms of being able to support DNS endpoints in a cluster that otherwise has static IPs, we can do that detection in Istio and ship that entire cluster as a STRICT_DNS cluster. What envoy would need to do is to support locality LB for strict DNS, and not just EDS [which is a useful thing in its own accord].

lizan commented 5 years ago

Aside from the DNS issue, (sorry I wasn't aware of this issue before), overriding TLS context (perhaps with merge context) per endpoint is raised in the past, and I think that is the right way to go vs defining different clusters and aggregate them. This is more lightweight solution in the scenario that each endpoint have different certificate with different SAN, which I believe is the Consul issue and the reason people is also asking about wildcard match of SAN.

incfly commented 5 years ago

overriding TLS context (perhaps with merge context) per endpoint is raised in the past, and I think that is the right way to go

Any pointer on that one? We probably will make a similar feature request as well

lizan commented 5 years ago

@incfly I can't find the exact issue, but I think we discussed this in the context of SDS, certification validation context and secure naming, perhaps it is in one of them. There was also request for overriding SNI for endpoint.

htuch commented 5 years ago

Folks, I think the TLS discussion is orthogonal. Can we get some clarity on whether there is an actual need to have clusters with completely different discovery types at different priorities (EDS vs. DNS)? This was what motivated aggregate clusters originally I believe.

yxue commented 5 years ago

Walmart's use case does include clusters with different discovery types. EDS for hosts inside mesh and DNS for hosts outside the mesh. The requirement is that failing over from internal to external when internal hosts are unhealthy.

lizan commented 5 years ago

@htuch right so that comes down to my question in the doc, aside from TLS, can't that address by control plane rather than adding a new type of clusters? The proposal sounds overkill with the insane math implied to address the specific issue.

yxue commented 5 years ago

The proposal sounds overkill with the insane math implied to address the specific issue.

@lizan, I haven't update the doc yet. The load balancer will reuse current load balancing algorithm.

htuch commented 5 years ago

@lizan I also suggested control plane early on, I think @duderino might be able to clarify why that wasn't considered the appropriate solution here.

rshriram commented 5 years ago

per the original issue by @uk000, they needed to load balance traffic to k8s pod ips plus dns endpoints, wherein traffic to k8s pods should use mtls while the one to dns should use plain text. So @htuch the tls discussion is very much related to this. A cluster with different discovery types and different tls blocks. And I am saying that the control plane can handle the different discovery type thing by simply using strict dns (if locality lb is implemented for it), while the selective tls can be accomplished by the cluster level filters.

snowp commented 5 years ago

@rshriram afaik locality lb has worked with STRICT_DNS for months (ever since load_assignment was added to the cluster), as most of this code is shared between all the cluster types. Are you seeing otherwise?

rshriram commented 5 years ago

Thanks for checking on that.. I can confirm the same from Pilot code as well (we have been using it for months). So, the entire issue boils down to doing selective TLS within the cluster. I then do not understand the rationale for doing this aggregate cluster thing..

duderino commented 5 years ago

@htuch the drawbacks from making the control plane do the resolution, as I recall, in order of convincingness:

1) dns resolvers may return different answers depending on where the dns client is. If the control plane dns client isn't near the envoy, it may get an answer that is inappropriate or even unroutable for the envoy 2) general concerns about adding latency by sending the dns resolution through the control plane to envoy. 3) I've lost track of whether the xDS updates would be incremental, but if not that'd be a ton of work.

@rshriram if we used the filter api to do this, we would have to know in the filter when all the endpoints in a cluster are dead / overloaded / in general bad and then provide an alternate dest addr or, better, fqdn. Less essential but still desired is to change the TLS details on the alternate connection (potentially failing over from mTLS to TLS or even to the clear which ... well ...).

I am sympathetic to the objection that aggregate cluster is a lot of complexity for a simple request (I just want to failover to a fqdn darnit). I wonder though... If this use case is handled with some ad hoc mechanism, and other similar use cases are handled by different ad hoc mechanisms, would we end up with more complexity (both internal and the API) and weird interactions that test cases fail to cover?

rshriram commented 5 years ago

I was not suggesting doing DNS resolution from the control plane. I was suggesting using the STRICT_DNS clusters where dns resolution is done by Envoy. Second, all the failover logic is already in place in form of locality/priority lb. We dont have do to anything other than deciding whether or not we should use TLS for that endpoint when establishing a connection to an endpoint in the cluster.

duderino commented 5 years ago

@rshriram others were suggesting doing DNS resolution from the control plane so I was addressing that.

@flatmap13 and @uk000 are the originators of the fqdn fallback request. Pulling them in to get rid of the lossy filter effects.

uk000 commented 5 years ago

Just summarizing the discussion so far for my own understanding :) From end result viewpoint, if istio control plane configures an envoy cluster (type STRICT_DNS) that has both internal k8s and external fqdn endpoints with appropriate priorities and localities, so that envoy routes to the internal k8s endpoints with higher preference and then falls back to the external fqdn, and the external fqdn gets resolved via DNS (with caching etc), that solves the routing aspect of the problem. Then comes the need for the same cluster to support different TLS contexts (mTLS vs TLS vs plain) for different endpoints. @rshriram 's proposal of delegating the TLS context decision to a filter has the benefit of not adding additional complexity until really needed. OTOH, @duderino 's thought has merit that support for hybrid clusters can provide flexibility for any complex future needs as well.

mattklein123 commented 5 years ago

From my perspective, given that the aggregate cluster is an extension and not being built into the core, I don't really have a strong opinion on which way you all go on this topic. The aggregate cluster type with an extremely simple policy to choose between the clusters seems a reasonable and generic solution but if there is a simpler solution to your problem that works also.

incfly commented 5 years ago

Just FYI, I filed https://github.com/envoyproxy/envoy/issues/8016 separately to track the proposal of customize per endpoint TLS configuration, delegating that decision to a cluster filter extension.

(based on my understanding that's a useful feature per above comment, but itself may not be enough to solve all the problems w.r.t falling back to remote endpoint)

duderino commented 5 years ago

@rshriram I'm not sure this boils down to selective TLS / per endpoint TLS configuration within a cluster. I think @flatmap13 and @uk000 want to prefer local k8s endpoints (mTLS) and if all of those are down failover to a remote fqdn (TLS or even clear).

Can that be addressed using selective TLS / per endpoint TLS config alone?

rshriram commented 5 years ago

I think @flatmap13 and @uk000 want to prefer local k8s endpoints (mTLS) and if all of those are down failover to a remote fqdn (TLS or even clear).

This is already addressed using the existing locality based failover.

howardjohn commented 5 years ago

This is already addressed using the existing locality based failover.

This does not address the mtls issue though, right?

If we do this with a STRICT_DNS cluster we loose all other load balancing functionality, right? Because we would return [foo.svc.cluster.local, foo.external.com], which would resolve to just the IP of the kubernetes svc rather than all of the endpoints. Maybe we instead send all of the endpoints instead of the service name, but I suspect that would have substantial performance implications

duderino commented 5 years ago

This is already addressed using the existing locality based failover.

This does not address the mtls issue though, right?

It does not, but if that's the only limitation it could be addressed by @incfly's simpler proposal at #8016

If we do this with a STRICT_DNS cluster we loose all other load balancing functionality, right? Because we would return [foo.svc.cluster.local, foo.external.com], which would resolve to just the IP of the kubernetes svc rather than all of the endpoints. Maybe we instead send all of the endpoints instead of the service name, but I suspect that would have substantial performance implications

I think it was this limitation that took us down the aggregate cluster path. Thanks @howardjohn. @rshriram wdyt?

rshriram commented 5 years ago

you dont have to do DNS resolution of the k8s service.

cluster:
 name: outbound|foo.bar.svc.cluster.local
 type: strict_dns
 hosts:
 - 1.1.1.1 [pod IP], mtlsReady
 - 2.2.2.2 [pod IP], mtlsReady
 - 3.3.3.3 [pod IP], mtlsReady
 - myvm.gcp.com [VM dns name]

The only piece you need for the above is locality aware routing. We do not have to provide the k8s service name as a hostname in the strictnds cluster. we can send the pod ips.

howardjohn commented 5 years ago

Well we would need to test it, but I suspect this would have substantial performance impact -- otherwise, why would we ever use EDS instead of just sending IPs in the cluster?

snowp commented 5 years ago

The reason to use EDS instead of a static list of IPs with CDS is that whenever the CDS definition changes the cluster will be rebuilt, which requires reestablishing connections (including TLS setup etc.) to all the upstreams. This can lead to a significant spike in CPU usage on the server side as many clients reconnect at the same time.

This is very noticeable when using active health checking, but can also happen when active health checking is not used if enough downstreams are sending traffic to the cluster that is being changed.

rshriram commented 5 years ago

Yes and if this is going to happen for each and every service, then its a different problem. If this heterogeneous mixture is just for a set of services, then this is needless optimization, not to mention trading off for unnecessary complexity and cognitive overhead on end users.

And to add to @snowp 's point, we use EDS primarily because we can ship just endpoints alone when they change [incremental] rather than the entire cluster. and it is absolutely fine for someone to pay this cpu price in exchange for valuable functionality of being able to mix VMs with k8s pods in an envoy cluster, to aid transition.

lizan commented 5 years ago

re: mixing DNS names in EDS, there is a previous issue https://github.com/envoyproxy/envoy/issues/4290 and setting address with DNS name with resolver_name should be supported, wrapping them with LocalityLbEndpoints will do what you want to do I think.

envoyproxy / envoy

Load balancer fallbacks to another cluster #7454