Closed yxue closed 4 years ago
I'm pretty sure that's by design - if you want to use DNS, then stick to STRICT_DNS
or LOGICAL_DNS
clusters.
Going back to the real problem at hand (i.e. fallback to external service if all local endpoints are down), then I believe that the real solution is fallback to another cluster, and not hacking EDS. Notably, the fallback service might have very different settings (TLS certificates, timeouts, to name a few) than the local endpoints, so we need to have the ability to configure those.
+1 to what @PiotrSikora said. If you want DNS I would deliver a DNS cluster over CDS.
This came up recently and I thought we already had an issue on this but I can't quickly find it.
I don't think there is going to be any easy way to do this, especially if you want support across both TCP and HTTP. For HTTP we could likely do this via some type of retry plugin and allow the router to target a different cluster on retry. For TCP I suppose we could do something similar also.
It might also be possible to somehow build an aggregate cluster type that could contain multiple clusters and failover between them? @snowp any thoughts here?
Are you referring this issue? @mattklein123 https://github.com/envoyproxy/envoy/issues/7135#issue-451246689
I opened that issue for introducing opportunistic encryption. Due to security issue, Piotr suggested to close that one.
Between retry for HTTP/TCP and aggregate cluster, the latter seems to be more promising to me.
I think aggregate cluster + aggregate LB might be the easiest way to accomplish this, passing information in the LB context that can influence the decision of which cluster/lb to select. That way it would be fairly easy to reuse for multiple proxy filters (router, tcp_proxy, etc.)
Here's a Wild-and-Crazy-Idea; configure a listener on the Envoy process itself as a fallback in EDS, and have this listener TCP proxy to a DNS discovered cluster. This comes at some operational and CPU cost, but avoids having to build a massively complicated aggregated cluster type and LB, while providing the flexibility to do complicated fallback for the rare cases it is needed.
So, let's say the loopback idea ^^ is not going to fly because of the CPU cost (and needing to overprovision to support this for fallback). How many others folks in the Envoy community would benefit from a tiered LB approach? It would be good to get some additional motivation.
We would have benefitted from a tired LB approach a while back during our rollout to fall back to a not health checked VIP cluster after attempting to hit the actively health checked service mesh cluster. The health checking bit makes it difficult to express as retries within the cluster, since the fallback endpoints would also get health checked.
I'm not sure if we'd use it right now as we're mostly migrated at this point cc @mpuncel
This is something we are running into with Consul. We need different Upstream TLS context for some failover endpoints. Being able to failover instead to another cluster would work just as well allowing us to configuring the TLS context for that cluster separately.
Will tiered LB handle the use case that @mkeeler mentions? Maybe the simplest thing to do would be able to have Cluster failover rather than LB tiering.
@htuch That sounds perfect. We are trying to map the concept of failover into LB endpoints with priorities but being able to just specify an alternative cluster to use for failover would be a much closer match to the concept we are trying to implement.
The problem with "cluster failover" is if we keep the clusters separate we will need to build config/code logic in all of the users to handle failover to the other cluster. This is doable but seems sub-optimal. Another option would be to have some kind of failover list within a cluster itself, and some ability to grab hosts from the failover cluster, but at that point it seems cleaner to me to just build the aggregate cluster type.
For Consul, the root of the problem really is that the UpstreamTLS context has to be the same for all endpoints in the cluster.
Failing over to another cluster is just one possible solution (that happens to conceptually map really nicely to what we are trying to accomplish)
If there was some subset of endpoints used for failover, that would work to so long as we could override the TLS context.
If there was some subset of endpoints used for failover, that would work to so long as we could override the TLS context.
I think this is feasible, especially with the subset LB. I think it would be nice to clarify the actual end user cases for everyone following this thread and we can go from there?
@mattklein123 I think there are multiple requirements here:
(2) is a requirement coming from Istio based on real-world use cases (not sure how much I can share about specifics). (2) is also seemingly harder to do with just a failover endpoint subset.
@htuch I think Istio's case covers both of them. Background is that our customer tries to failover from internal service to external service.
From everything I've seen here I still think the aggregate cluster is going to be the cleanest way of doing this and will give us a lot of flexibility. IMO it wouldn't be that bad to implement either.
/cc @incfly
Even though I know the https://github.com/envoyproxy/envoy/issues/7135 is closed, but thinking from another perspective as Istio, if we allow TLS settings to be override for a subset endpoints or a sub-cluster, the best effort mTLS adoption can be really simpler.
We no longer blindly retry TLS and fallback to plaintext if fails. The guided information from xDS server can guide the Envoy, endpoint subset A is able to serve mTLS, we do OE; if not, Envoy does not bother to do two attempts of TCP connection.
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.
Working on the design doc of aggregate cluster
Also have an implementation for this design. https://github.com/envoyproxy/envoy/pull/7967
cc @mattklein123 @htuch @snowp @duderino PTAL
Thanks @yxue, I've left a couple of comments on the doc.
Here is a much simpler and more extensible proposal rather than adding yet another type of a cluster.
Extend the filter api in the cluster to be able to decide / mutate the settings that need to be used for a given endpoint. Settings include TLS settings, http2 settings, etc. This filter callback needs to be invoked before making the connection. This is all envoy would have to do. I would not do anything else, else its additional baggage in Envoy core for no good reason.
We (in Istio) can then write a cluster level filter that takes advantage of this functionality. We will keep the filter in a resuable module so that the Consul folks can also use it. Please dont add this aggregate cluster which is confusing AF, and further complicates the load balancing behavior of a cluster.
And in terms of being able to support DNS endpoints in a cluster that otherwise has static IPs, we can do that detection in Istio and ship that entire cluster as a STRICT_DNS cluster. What envoy would need to do is to support locality LB for strict DNS, and not just EDS [which is a useful thing in its own accord].
Aside from the DNS issue, (sorry I wasn't aware of this issue before), overriding TLS context (perhaps with merge context) per endpoint is raised in the past, and I think that is the right way to go vs defining different clusters and aggregate them. This is more lightweight solution in the scenario that each endpoint have different certificate with different SAN, which I believe is the Consul issue and the reason people is also asking about wildcard match of SAN.
overriding TLS context (perhaps with merge context) per endpoint is raised in the past, and I think that is the right way to go
Any pointer on that one? We probably will make a similar feature request as well
@incfly I can't find the exact issue, but I think we discussed this in the context of SDS, certification validation context and secure naming, perhaps it is in one of them. There was also request for overriding SNI for endpoint.
Folks, I think the TLS discussion is orthogonal. Can we get some clarity on whether there is an actual need to have clusters with completely different discovery types at different priorities (EDS vs. DNS)? This was what motivated aggregate clusters originally I believe.
Walmart's use case does include clusters with different discovery types. EDS for hosts inside mesh and DNS for hosts outside the mesh. The requirement is that failing over from internal to external when internal hosts are unhealthy.
@htuch right so that comes down to my question in the doc, aside from TLS, can't that address by control plane rather than adding a new type of clusters? The proposal sounds overkill with the insane math implied to address the specific issue.
The proposal sounds overkill with the insane math implied to address the specific issue.
@lizan, I haven't update the doc yet. The load balancer will reuse current load balancing algorithm.
@lizan I also suggested control plane early on, I think @duderino might be able to clarify why that wasn't considered the appropriate solution here.
per the original issue by @uk000, they needed to load balance traffic to k8s pod ips plus dns endpoints, wherein traffic to k8s pods should use mtls while the one to dns should use plain text. So @htuch the tls discussion is very much related to this. A cluster with different discovery types and different tls blocks. And I am saying that the control plane can handle the different discovery type thing by simply using strict dns (if locality lb is implemented for it), while the selective tls can be accomplished by the cluster level filters.
@rshriram afaik locality lb has worked with STRICT_DNS for months (ever since load_assignment was added to the cluster), as most of this code is shared between all the cluster types. Are you seeing otherwise?
Thanks for checking on that.. I can confirm the same from Pilot code as well (we have been using it for months). So, the entire issue boils down to doing selective TLS within the cluster. I then do not understand the rationale for doing this aggregate cluster thing..
@htuch the drawbacks from making the control plane do the resolution, as I recall, in order of convincingness:
1) dns resolvers may return different answers depending on where the dns client is. If the control plane dns client isn't near the envoy, it may get an answer that is inappropriate or even unroutable for the envoy 2) general concerns about adding latency by sending the dns resolution through the control plane to envoy. 3) I've lost track of whether the xDS updates would be incremental, but if not that'd be a ton of work.
@rshriram if we used the filter api to do this, we would have to know in the filter when all the endpoints in a cluster are dead / overloaded / in general bad and then provide an alternate dest addr or, better, fqdn. Less essential but still desired is to change the TLS details on the alternate connection (potentially failing over from mTLS to TLS or even to the clear which ... well ...).
I am sympathetic to the objection that aggregate cluster is a lot of complexity for a simple request (I just want to failover to a fqdn darnit). I wonder though... If this use case is handled with some ad hoc mechanism, and other similar use cases are handled by different ad hoc mechanisms, would we end up with more complexity (both internal and the API) and weird interactions that test cases fail to cover?
I was not suggesting doing DNS resolution from the control plane. I was suggesting using the STRICT_DNS clusters where dns resolution is done by Envoy. Second, all the failover logic is already in place in form of locality/priority lb. We dont have do to anything other than deciding whether or not we should use TLS for that endpoint when establishing a connection to an endpoint in the cluster.
@rshriram others were suggesting doing DNS resolution from the control plane so I was addressing that.
@flatmap13 and @uk000 are the originators of the fqdn fallback request. Pulling them in to get rid of the lossy filter effects.
Just summarizing the discussion so far for my own understanding :) From end result viewpoint, if istio control plane configures an envoy cluster (type STRICT_DNS) that has both internal k8s and external fqdn endpoints with appropriate priorities and localities, so that envoy routes to the internal k8s endpoints with higher preference and then falls back to the external fqdn, and the external fqdn gets resolved via DNS (with caching etc), that solves the routing aspect of the problem. Then comes the need for the same cluster to support different TLS contexts (mTLS vs TLS vs plain) for different endpoints. @rshriram 's proposal of delegating the TLS context decision to a filter has the benefit of not adding additional complexity until really needed. OTOH, @duderino 's thought has merit that support for hybrid clusters can provide flexibility for any complex future needs as well.
From my perspective, given that the aggregate cluster is an extension and not being built into the core, I don't really have a strong opinion on which way you all go on this topic. The aggregate cluster type with an extremely simple policy to choose between the clusters seems a reasonable and generic solution but if there is a simpler solution to your problem that works also.
Just FYI, I filed https://github.com/envoyproxy/envoy/issues/8016 separately to track the proposal of customize per endpoint TLS configuration, delegating that decision to a cluster filter extension.
(based on my understanding that's a useful feature per above comment, but itself may not be enough to solve all the problems w.r.t falling back to remote endpoint)
@rshriram I'm not sure this boils down to selective TLS / per endpoint TLS configuration within a cluster. I think @flatmap13 and @uk000 want to prefer local k8s endpoints (mTLS) and if all of those are down failover to a remote fqdn (TLS or even clear).
Can that be addressed using selective TLS / per endpoint TLS config alone?
I think @flatmap13 and @uk000 want to prefer local k8s endpoints (mTLS) and if all of those are down failover to a remote fqdn (TLS or even clear).
This is already addressed using the existing locality based failover.
This is already addressed using the existing locality based failover.
This does not address the mtls issue though, right?
If we do this with a STRICT_DNS cluster we loose all other load balancing functionality, right? Because we would return [foo.svc.cluster.local, foo.external.com], which would resolve to just the IP of the kubernetes svc rather than all of the endpoints. Maybe we instead send all of the endpoints instead of the service name, but I suspect that would have substantial performance implications
This is already addressed using the existing locality based failover.
This does not address the mtls issue though, right?
It does not, but if that's the only limitation it could be addressed by @incfly's simpler proposal at #8016
If we do this with a STRICT_DNS cluster we loose all other load balancing functionality, right? Because we would return [foo.svc.cluster.local, foo.external.com], which would resolve to just the IP of the kubernetes svc rather than all of the endpoints. Maybe we instead send all of the endpoints instead of the service name, but I suspect that would have substantial performance implications
I think it was this limitation that took us down the aggregate cluster path. Thanks @howardjohn. @rshriram wdyt?
you dont have to do DNS resolution of the k8s service.
cluster:
name: outbound|foo.bar.svc.cluster.local
type: strict_dns
hosts:
- 1.1.1.1 [pod IP], mtlsReady
- 2.2.2.2 [pod IP], mtlsReady
- 3.3.3.3 [pod IP], mtlsReady
- myvm.gcp.com [VM dns name]
The only piece you need for the above is locality aware routing. We do not have to provide the k8s service name as a hostname in the strictnds cluster. we can send the pod ips.
Well we would need to test it, but I suspect this would have substantial performance impact -- otherwise, why would we ever use EDS instead of just sending IPs in the cluster?
The reason to use EDS instead of a static list of IPs with CDS is that whenever the CDS definition changes the cluster will be rebuilt, which requires reestablishing connections (including TLS setup etc.) to all the upstreams. This can lead to a significant spike in CPU usage on the server side as many clients reconnect at the same time.
This is very noticeable when using active health checking, but can also happen when active health checking is not used if enough downstreams are sending traffic to the cluster that is being changed.
Yes and if this is going to happen for each and every service, then its a different problem. If this heterogeneous mixture is just for a set of services, then this is needless optimization, not to mention trading off for unnecessary complexity and cognitive overhead on end users.
And to add to @snowp 's point, we use EDS primarily because we can ship just endpoints alone when they change [incremental] rather than the entire cluster. and it is absolutely fine for someone to pay this cpu price in exchange for valuable functionality of being able to mix VMs with k8s pods in an envoy cluster, to aid transition.
re: mixing DNS names in EDS, there is a previous issue https://github.com/envoyproxy/envoy/issues/4290 and setting address with DNS name with resolver_name should be supported, wrapping them with LocalityLbEndpoints will do what you want to do I think.
When using EDS to update the endpoint configuration dynamically, DNS resolution is not allowed for hostname in EDS response, as the comment says:and the DNS is now allowed for customer resolverWhen adding some external services to the load balancing endpoints via EDS. It would be impractical to add the IP of the endpoints considering service VMs can and will go up and down, and the service's endpoint IP addresses will change frequently. Supporting hostname in EDS response seems to be a reasonable solution.Not sure if not supporting DNS resolution for hostname is by design or the Envoy restriction. Please let me know and I can help working on this feature if Envoy needs it.In istio, we use priority filed to implement failover logic. When the endpoints in higher priority are down, the load balancer will select the endpoints with lower priority. It assumes that all the endpoints have the same settings (e.g. TLS context). But sometimes it may be different. For example, for external fallback service, mTLS is not required, but inside the service mesh, mTLS is required. If the external service endpoints and internal service endpoints are added into one cluster. The traffic to external endpoint will be broken.
Downstream Envoy setting:
Upstream listener setting:
When the
proxy:443
is down, the traffic toproxy:80
will be broken as well becauseproxy:80
doesn't support mTLS.Thanks @PiotrSikora for the solution. Allow load balancer to fallback to another cluster would solve the problem. For above case, split the cluster configuration into two clusters and load balancer can select another cluster when one cluster is down and use the setting for the cluster.
cc @duderino @htuch @PiotrSikora @mattklein123