cilium / cilium

eBPF-based Networking, Security, and Observability
https://cilium.io
Apache License 2.0
19.91k stars 2.92k forks source link

CFP: Enable Proxyless gRPC Connections to xDS #30189

Closed DerekTBrown closed 6 months ago

DerekTBrown commented 8 months ago

Enable Proxyless gRPC Connections

Background: What is xDS?

xDS is a generic API Specification that allows clients to query a service mesh control plane for a variety of common resources such as clusters, routes, endpoints, and is extensible to include other features such as load reporting, health checking and rate-limiting.

The primary usecase for xDS is for dynamic Envoy configuration. When configured, Envoy will contact one or more xDS servers to fetch these dynamic resources so that it can appropriately route traffic. Under the hood, this is how Cilium configures Envoy when L7 load balancing is enabled (Code).

What is proxyless gRPC?

gRPC has added functionality to contact xDS directly, so that gRPC clients can route requests directly without needing to go through an L7 proxy (gRPC Docs). When configured, gRPC contacts the xDS server to fetch routing information, and then resolves and routes requests fully within the client (without the need for DNS or an L7 proxy). This architecture enables applications to operate with lower latency, and with less overhead to the cluster overall.

Proxyless gRPC is currently supported by Istio and Google's Traffic Director.

How would Cilium enable proxyless gRPC?

Cilium agents currently have a lightweight implementation of the xDS protocol in order to configure Envoy (Code). However, this xDS server is currently exposed via a unix socket, preventing external clients from accessing xDS (Code).

With a slight adjustment, Cilium agent could be configured to expose this xDS server via a hostPort, enabling gRPC clients to contact the xDS server via the Node's IP address. gRPC clients can then route requests directly according to the Cilium control plane without the need for L7 Load Balancing.

_Caveat: We may need to extend the Cilium xDS implementation to include ADS to be compatible with the gRPC xDS implementation and to be more performant._

Cilium's xDS implementation could additionally be extended to retrieve and cache xDS resources from upstream management servers:

flowchart LR
    subgraph Control Plane Node
        K["K8s Control Plane"]
        M["xDS Management Server"]
    end
    subgraph Worker Node
        subgraph Cilium Pod
            C["xDS Server"]
            A["xDS Cache"]
            E["Envoy"]
            C <--> A
            M <--> A
            K <--> A
            E <--> C
        end
        subgraph Workloads
            G["gRPC Client Pod"]
            C <--> G
            H["HTTP Client Pod"]
            H <--> C
        end
    end

This would allow gRPC clients to receive the full features of an L7 service mesh without the need for a L7 proxy or cross-node look-asides.

Note: There is no reason that this feature set is restricted to gRPC. Users could implement clients that connect to other protocols (ie HTTP) via xDS as well.

How does this fit into the overall Cilium project roadmap?

Cilium's architecture makes it the go-to service mesh for users who prioritize latency, performance and efficiency (for instance, by leveraging eBPF, having a sidecar-less design, etc). When it comes to implementing L7 features, performance-minded users will want to leverage proxyless gRPC to avoid the performance costs of using Envoy (Performance comparison of Envoy and Proxyless gRPC).

howardjohn commented 8 months ago

Just curious why you don't go "gRPC Client pod" to "xDS management Server" directly?

youngnick commented 8 months ago

Thanks for bringing this up @DerekTBrown!

I've got a couple of bits of information for you, and a couple of questions.

Firstly, the way that Envoy and the Cilium agent currently use sockets for communication is by design; it helps to manage the risk that we accept in having a per-node proxy (we definitely think it's worthwhile, but it is something we need to manage). Not having the xDS server listening on a network socket means that we can more easily control where that information is surfaced, and can more safely leave the xDS traffic itself in the clear (since it only passes over the domain socket).

Exposing the xDS control plane on a network address instead has a few complications:

Okay, so that's all to explain why things are currently in "use a socket for xDS" mode. Again, this is not a hard requirement, but the tradeoffs of swapping mean that the resultant extra feature would need to be pretty valuable for a lot of folks for us to focus on it.

Next, the questions.

In general, for Cilium Service Mesh, we've been working on making many Service Mesh functions be transparently handled by Cilium. It seems to me like the gRPC xDS support is trading off client complexity for lower latency (by having the client manage the load balancing to avoid a proxy hop). Do you think that's a fair characterization?

I ask because I currently don't see that the tradeoff there is worth it. That's not to say I'm right, or that I am not interested in discussing this more, but my two biggest concerns are:

To be clear, it's a really neat idea, and I can see that low-latency use cases would be willing to take the tradeoffs. But it seems like a lot of things to support for something that I haven't heard anyone ask for yet.

I'd be very interested to hear more about your thoughts, please feel free to reach out here or on Slack, or come to a community meeting (and ping me beforehand so I can ensure I'm awake).

DerekTBrown commented 8 months ago

Thanks @youngnick for your thorough and thoughtful response! I am planning to attend the Cilium community meeting tomorrow morning, so that hopefully we can bounce some ideas around- I am still early in the process of thinking about how Cilium, xDS and gRPC could fit together.

To perhaps rewind a bit, there are a few reasons that I have been exploring the xDS gRPC client's integration with Cilium:

Beyond the "proxyless" xDS mode, I am also interested in learning more generally about Cilium's plans for xDS. The way Cilium leverages xDS is very cool, but also very unorthodox (by having agents expose the xDS API, as opposed to a central control plane). I want to understand how this fits in with the general xDS ecosystem (users bring their own xDS APIs to provide features). For instance: if I want to implement my own xDS service, do I pipe that through Cilium Agents, do I inject that into the Envoy config, or is it breaking the Cilium interface to touch xDS configuration at all?

DerekTBrown commented 8 months ago

Just curious why you don't go "gRPC Client pod" to "xDS management Server" directly?

In my vision of how this might work, Cilium Agent would implement a subset of xDS APIs (i.e. EDS, CDS, RDS) itself, and then proxy, cache and multiplex requests to upstream xDS APIs (i.e. Ratelimiting Service).

github-actions[bot] commented 6 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

github-actions[bot] commented 6 months ago

This issue has not seen any activity since it was marked stale. Closing.