Closed DerekTBrown closed 6 months ago
Just curious why you don't go "gRPC Client pod" to "xDS management Server" directly?
Thanks for bringing this up @DerekTBrown!
I've got a couple of bits of information for you, and a couple of questions.
Firstly, the way that Envoy and the Cilium agent currently use sockets for communication is by design; it helps to manage the risk that we accept in having a per-node proxy (we definitely think it's worthwhile, but it is something we need to manage). Not having the xDS server listening on a network socket means that we can more easily control where that information is surfaced, and can more safely leave the xDS traffic itself in the clear (since it only passes over the domain socket).
Exposing the xDS control plane on a network address instead has a few complications:
Okay, so that's all to explain why things are currently in "use a socket for xDS" mode. Again, this is not a hard requirement, but the tradeoffs of swapping mean that the resultant extra feature would need to be pretty valuable for a lot of folks for us to focus on it.
Next, the questions.
In general, for Cilium Service Mesh, we've been working on making many Service Mesh functions be transparently handled by Cilium. It seems to me like the gRPC xDS support is trading off client complexity for lower latency (by having the client manage the load balancing to avoid a proxy hop). Do you think that's a fair characterization?
I ask because I currently don't see that the tradeoff there is worth it. That's not to say I'm right, or that I am not interested in discussing this more, but my two biggest concerns are:
To be clear, it's a really neat idea, and I can see that low-latency use cases would be willing to take the tradeoffs. But it seems like a lot of things to support for something that I haven't heard anyone ask for yet.
I'd be very interested to hear more about your thoughts, please feel free to reach out here or on Slack, or come to a community meeting (and ping me beforehand so I can ensure I'm awake).
Thanks @youngnick for your thorough and thoughtful response! I am planning to attend the Cilium community meeting tomorrow morning, so that hopefully we can bounce some ideas around- I am still early in the process of thinking about how Cilium, xDS and gRPC could fit together.
To perhaps rewind a bit, there are a few reasons that I have been exploring the xDS gRPC client's integration with Cilium:
Latency/Overhead/Complexity - As you mentioned, the idea of being able to skip the Envoy hop is appealing. It is really hard for me to get a sense of how significant this improvement would be (if the Envoy literature is to be believed, the improvement would be <1ms, if the linkerd literature is accurate, then we could be talking about >50ms). I would be especially interested to hear from you (since I am guessing you have a bunch of cool experiences with different Cilium/Envoy installs) if this latency is significant (to where investment in proxyless xDS functionality is worthwhile), or whether this is an over-optimization.
Beyond the latency/cost benefit, I think there could be a complexity benefit if people elected to use xDS-enabled clients instead of Envoy.
Client-to-server Encryption - As I understand it, Cilium/Envoy can only provide full L7 routing capabilities if traffic is unencrypted to the proxy. The current paradigm in our clusters is to perform client-to-server encryption in the gRPC client itself, meaning that Cilium/Envoy can't introspect requests to be able to provide L7 capabilities. gRPC's xDS capabilities was one option we were considering to gain L7 functionality without having to disable TLS in gRPC.
Observability/Debuggability/Extensibility - One interesting aspect of client-side xDS functionality is that the client is aware of the L7 decisions its making. This makes it potentially easier for a service owner to collect, observe and debug service routing behavior. I think it also becomes easier for folks to extend xDS functionality on the client side, because such changes can be made in a fork/wrapper/different implementation of a particular client, as opposed to having to be made and merged into Cilium + Envoy.
Beyond the "proxyless" xDS mode, I am also interested in learning more generally about Cilium's plans for xDS. The way Cilium leverages xDS is very cool, but also very unorthodox (by having agents expose the xDS API, as opposed to a central control plane). I want to understand how this fits in with the general xDS ecosystem (users bring their own xDS APIs to provide features). For instance: if I want to implement my own xDS service, do I pipe that through Cilium Agents, do I inject that into the Envoy config, or is it breaking the Cilium interface to touch xDS configuration at all?
Just curious why you don't go "gRPC Client pod" to "xDS management Server" directly?
In my vision of how this might work, Cilium Agent would implement a subset of xDS APIs (i.e. EDS, CDS, RDS) itself, and then proxy, cache and multiplex requests to upstream xDS APIs (i.e. Ratelimiting Service).
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
This issue has not seen any activity since it was marked stale. Closing.
Enable Proxyless gRPC Connections
Background: What is xDS?
xDS is a generic API Specification that allows clients to query a service mesh control plane for a variety of common resources such as clusters, routes, endpoints, and is extensible to include other features such as load reporting, health checking and rate-limiting.
The primary usecase for xDS is for dynamic Envoy configuration. When configured, Envoy will contact one or more xDS servers to fetch these dynamic resources so that it can appropriately route traffic. Under the hood, this is how Cilium configures Envoy when L7 load balancing is enabled (Code).
What is proxyless gRPC?
gRPC has added functionality to contact xDS directly, so that gRPC clients can route requests directly without needing to go through an L7 proxy (gRPC Docs). When configured, gRPC contacts the xDS server to fetch routing information, and then resolves and routes requests fully within the client (without the need for DNS or an L7 proxy). This architecture enables applications to operate with lower latency, and with less overhead to the cluster overall.
Proxyless gRPC is currently supported by Istio and Google's Traffic Director.
How would Cilium enable proxyless gRPC?
Cilium agents currently have a lightweight implementation of the xDS protocol in order to configure Envoy (Code). However, this xDS server is currently exposed via a unix socket, preventing external clients from accessing xDS (Code).
With a slight adjustment, Cilium agent could be configured to expose this xDS server via a
hostPort
, enabling gRPC clients to contact the xDS server via the Node's IP address. gRPC clients can then route requests directly according to the Cilium control plane without the need for L7 Load Balancing._Caveat: We may need to extend the Cilium xDS implementation to include ADS to be compatible with the gRPC xDS implementation and to be more performant._
Cilium's xDS implementation could additionally be extended to retrieve and cache xDS resources from upstream management servers:
This would allow gRPC clients to receive the full features of an L7 service mesh without the need for a L7 proxy or cross-node look-asides.
Note: There is no reason that this feature set is restricted to gRPC. Users could implement clients that connect to other protocols (ie HTTP) via xDS as well.
How does this fit into the overall Cilium project roadmap?
Cilium's architecture makes it the go-to service mesh for users who prioritize latency, performance and efficiency (for instance, by leveraging eBPF, having a sidecar-less design, etc). When it comes to implementing L7 features, performance-minded users will want to leverage proxyless gRPC to avoid the performance costs of using Envoy (Performance comparison of Envoy and Proxyless gRPC).