Scalable Pod IP matches in listener

howardjohn commented 1 year ago

Title: Scalable Pod IP matches in listener

Description: In Istio, we setup listeners with bind_to_port=false and use original destination redirection to match them. So we, roughly, have a Listener per kubernetes Service, matching on the Service IP. This works well enough.

The problem we have is traffic not to services. In Istio today, we support this only if there is a headless service associated with the pod. A headless service is more or less just a way to expose the backends via DNS, compared to a normal service which has VIP. From Envoy's perspective, they requests look the same. We only support headless services, not all direct-to-pod, since our approach is terrible from a scale perspective. What we currently do is just like services, we have a listener per pod.

This scales quite poorly. Pods are large in number and churning often. Listeners are large, and we haven't fully adopted delta XDS to send incremental updates for them.

Our current model is really not viable. We see for customers with modest headless services (~250 pods) have LDS payloads of 10mb+. Along with these churning frequently causing full pushes of the entire 10mb, this ends up being pretty terrible

We would like to improve on this state. Options discussed below:

Adopt Delta XDS.

No Envoy changes needed, this is a pure Istio side change. While we are working on this (for an number of reasons), this still doesn't fully solve the problem. Envoy still needs to full massive set of listeners, so its only an incremental improvement

Use AdditionalListeners

Instead of N listeners for a service, we can have 1 listener with N additional addresses. This almost works well, but a few problems:

Changing any IP of them causes a drain for all connections, even to other IPs (I am 95% sure, want to double check though)
There is one "primary" IP. This means if the one we pick happens to go away we have to shuffle things around. This is actually fine, as long as envoy handles this gracefully

Move to 1 big listener and filter chains

This gives us more flexibility, as we can just stick a list of IPs. This would mostly fix this problem. However, it contradicts (1). If we move to a single giant listener, we lose the ability to incrementally update things.

A theoretical new FilterChainDS could help this. Its not clear how this would interact with the new tree matcher, though.

kyessenov commented 1 year ago

In general, I think we should not use network inputs for matching here (so not using additional listeners or addresses to match destination IP), so that leaves option (3) with an extensible filter chain matcher that has its own config delivery, separated from the network flows.

tonya11en commented 11 months ago

Regarding a move to 1 big listener with a FilterChian per pod, would we really lose the ability to update things? Looking at the docs for a filter chain only update, I think it would be effectively incremental because it doesn't drain connections on filter chains that were not changed in the update:

If the new filter chain and the old filter chain is protobuf message equivalent, the corresponding filter chain runtime info survives. The connections owned by the survived filter chains remain open.

This would be the desired behavior, right? If so, I think all that's needed is to configure filter chains instead of full blown listeners for each pod.

kyessenov commented 11 months ago

Yes, @tonya11en the incremental drain is not affected. A change in the unified matcher logic does not trigger filter chain drains either, which solves (2):

Changing any IP of them causes a drain for all connections, even to other IPs (I am 95% sure, want to double check though)

howardjohn commented 11 months ago

I missed a bit of context I think.

There are two concerns around updates: whether it drains, and how much we have to update at a time from an xDS perspective (delta XDS)

Current state: we have ~1 listener per service. We do not do incremental updates, so if any change we push the entire set of listeners. In some customers, this is O(10mb). Changes are expensive (push all), but do not drain generally due to per-FC draining + isolated listeners.

Add delta: we could do delta XDS with listeners. This would mean we can incrementally push 1 listener over xds. This is a performance optimization only, no traffic facing behavior.

Move to single listener with filter chains: This removes the ability to do delta. We keep draining behavior due to per-FC drain. We cannot represent the current behavior since we have per-listener listener filters today. This config exists but isn't powerful enough; presumably we could extend it I don't know its possible to represent our current matching logic with filter_chain_match since I have only looked at filter_chain_matcher. We lose "Delta XDS" improvement we made.

Move to single listener with filter chains and add new "FilterChainDS": Same as above, but gives us back "Delta XDS" improvement. The precise listener filter enablement may make this kind of annoying, though.

Move to single listener with filter chains, add new "FilterChainDS", use filter_chain_matcher: This is a lot more powerful, and may be required to represent our current config (not sure). Downside is this is somewhat antagonistic to FilterChainDS -- we would need to have the full match tree pushed if any of the tree changes. We should evaluate the size of the full tree in a large cluster setup. Above I mentioned 10mb -- is the tree 90% of that? 1%? If its a small amount, this may be fine. If its a large amount, then this doesn't help much.

kyessenov commented 11 months ago

Mostly agree with the analysis but I think you missed a major point. You can reuse filter chains with one big direct listener that you cannot do with multiple indirect listeners. This factor will dominate and be close to the optimal representation which is: O(number of IPs that factor in decision) + O(number of different network datapaths). We save the whole multiplicative factor here.

tonya11en commented 11 months ago

What if we used internal listeners? It may not affect the xDS payload sizes much, but it'll certainly be less resource utilization and let's us incrementally update the listeners via delta xDS someday.

What I found neat about it is that we can do fancy things with endpoint metadata matching:

name: cluster_0
load_assignment:
  cluster_name: cluster_0
  endpoints:
  - lb_endpoints:
    - endpoint:
        address:
          envoy_internal_address:
            server_listener_name: some_internal_listener
       metadata:
           ...

I'll do some actual due-diligence tomorrow, but wanted to throw this idea out there in case one of you already considered and ruled it out.

kyessenov commented 11 months ago

Yeah, we looked at the internal listener for xDS composition. It didn't work out well, since it's expensive on the datapath with the extra buffers and doubling of listener overhead. Envoy performance was so bad that a simple Go/Rust implementation would outperform it. Internal listeners should be reserved for actual tunneling.

yanavlasov commented 11 months ago

Also listener updates can be optimized if needed. I thought Envoy already skips drains when only certain parts of listener config was affected. This can be extended to matchers if it is not already.

tonya11en commented 11 months ago

@yanavlasov I'm looking at this, so you can assign the issue to me

kyessenov commented 11 months ago

@yanavlasov Drains are controlled by the filter chain names, connections tied to chains that are not modified are left intact in both the unified matcher and the old matcher.

tonya11en commented 11 months ago

I quickly hacked the filter chain approach to see how much it reduces the payload sizes. Doesn't look very promising :(.

50 replicas in the headless service.

Listener per pod:

2023-12-12T00:22:50.483092Z info    ads LDS: PUSH for node:tcp-echo-86c789c6f4-r7c57.default resources:122 size:86.8kB
2023-12-12T00:22:50.486530Z info    ads LDS: PUSH for node:tcp-echo-86c789c6f4-mt99c.default resources:122 size:86.8kB
2023-12-12T00:22:50.486587Z info    ads LDS: PUSH for node:tcp-echo-86c789c6f4-fj6lj.default resources:122 size:86.8kB
2023-12-12T00:22:50.486692Z info    ads LDS: PUSH for node:tcp-echo-86c789c6f4-l7vbd.default resources:122 size:86.8kB
2023-12-12T00:22:50.487374Z info    ads LDS: PUSH for node:tcp-echo-86c789c6f4-bwmql.default resources:122 size:86.8kB
2023-12-12T00:22:50.487388Z info    ads LDS: PUSH for node:tcp-echo-86c789c6f4-7wtcn.default resources:122 size:86.8kB
2023-12-12T00:22:50.487716Z info    ads LDS: PUSH for node:tcp-echo-86c789c6f4-tc9rp.default resources:122 size:86.8kB
2023-12-12T00:22:50.487833Z info    ads LDS: PUSH for node:tcp-echo-86c789c6f4-qqsff.default resources:122 size:86.8kB
2023-12-12T00:22:50.489071Z info    ads LDS: PUSH for node:tcp-echo-86c789c6f4-kf6l9.default resources:122 size:86.8kB
2023-12-12T00:22:50.494092Z info    ads LDS: PUSH for node:tcp-echo-86c789c6f4-5w5kr.default resources:122 size:86.8kB

Filter chain per pod:

2023-12-12T00:17:42.020920Z info    ads LDS: PUSH for node:tcp-echo-86c789c6f4-nzbcl.default resources:26 size:85.1kB
2023-12-12T00:17:42.027833Z info    ads LDS: PUSH for node:tcp-echo-86c789c6f4-ncn9d.default resources:26 size:85.1kB
2023-12-12T00:17:42.031512Z info    ads LDS: PUSH for node:tcp-echo-86c789c6f4-zxcrc.default resources:26 size:85.1kB
2023-12-12T00:17:42.031938Z info    ads LDS: PUSH for node:tcp-echo-86c789c6f4-vhv79.default resources:26 size:85.1kB
2023-12-12T00:17:42.034123Z info    ads LDS: PUSH for node:tcp-echo-86c789c6f4-7td5w.default resources:26 size:85.1kB
2023-12-12T00:17:42.037265Z info    ads LDS: PUSH for node:tcp-echo-86c789c6f4-2fd7n.default resources:26 size:85.1kB
2023-12-12T00:17:42.039333Z info    ads LDS: PUSH for node:tcp-echo-86c789c6f4-z65z6.default resources:26 size:85.1kB
2023-12-12T00:17:42.036012Z info    ads LDS: PUSH for node:tcp-echo-86c789c6f4-cwz7d.default resources:26 size:85.1kB
2023-12-12T00:17:42.041103Z info    ads LDS: PUSH for node:tcp-echo-86c789c6f4-jbl8z.default resources:26 size:85.1kB
2023-12-12T00:17:42.036177Z info    ads LDS: PUSH for node:tcp-echo-86c789c6f4-gxcpb.default resources:26 size:85.1kB

Even though the listener resource count was much lower (26 vs 122), the actual payload size didn't decrease by much (85.1kB vs 86.8kB). I can try this again with more pods to see if the reduction is more noticeable, but will need to fiddle with my kind config. I'll do this next.

kyessenov commented 11 months ago

Are you sharing the filter chain config across the pods? I think the improvement requires that you have one shared filter chain for all headless pods. This has to be more efficient, just like we see with ECDS in https://github.com/istio/istio/pull/48256.

tonya11en commented 11 months ago

Good point. When I share filter chains it drops to ~62kB. I'll need to make a point to test the proxy resource utilization and see if there's a difference there.

Next, I'll get some numbers on the kinds of savings we'd expect if we compressed the payloads.

tonya11en commented 11 months ago

I wanted to see what kinds of payload size savings we'd expect if we compressed the gRPC response payloads. Currently, gRPC supports gzip, lz4, and snappy compression.

No changes (listener per pod):

2023-12-12T20:49:13.862196Z     info    ads     LDS: PUSH for node:tcp-echo-86c789c6f4-952v5.default resources:122 size:96.4kB gzip:5.0kB lz4:7.5kB snappy:10.6kB
2023-12-12T20:49:13.866110Z     info    ads     LDS: PUSH for node:tcp-echo-86c789c6f4-f7b9v.default resources:122 size:96.4kB gzip:5.0kB lz4:7.4kB snappy:10.6kB
2023-12-12T20:49:13.870051Z     info    ads     LDS: PUSH for node:tcp-echo-86c789c6f4-8d4fx.default resources:122 size:96.4kB gzip:5.0kB lz4:7.4kB snappy:10.6kB
2023-12-12T20:49:13.871056Z     info    ads     LDS: PUSH for node:tcp-echo-86c789c6f4-nm6vs.default resources:122 size:96.4kB gzip:4.9kB lz4:7.4kB snappy:10.5kB
2023-12-12T20:49:13.870131Z     info    ads     LDS: PUSH for node:tcp-echo-86c789c6f4-2ck7b.default resources:122 size:96.4kB gzip:4.9kB lz4:7.3kB snappy:10.5kB
2023-12-12T20:49:13.884184Z     info    ads     LDS: PUSH for node:tcp-echo-86c789c6f4-xwbbd.default resources:122 size:96.4kB gzip:4.9kB lz4:7.4kB snappy:10.4kB
2023-12-12T20:49:13.886300Z     info    ads     LDS: PUSH for node:tcp-echo-86c789c6f4-lpjbv.default resources:122 size:96.4kB gzip:5.0kB lz4:7.4kB snappy:10.5kB
2023-12-12T20:49:13.889805Z     info    ads     LDS: PUSH for node:tcp-echo-86c789c6f4-sqzg9.default resources:122 size:96.4kB gzip:4.9kB lz4:7.4kB snappy:10.5kB
2023-12-12T20:49:13.902850Z     info    ads     LDS: PUSH for node:tcp-echo-86c789c6f4-n7qsv.default resources:122 size:96.4kB gzip:5.0kB lz4:7.4kB snappy:10.5kB
2023-12-12T20:49:13.902855Z     info    ads     LDS: PUSH for node:tcp-echo-86c789c6f4-9hn6w.default resources:122 size:96.4kB gzip:4.9kB lz4:7.4kB snappy:10.5kB

Filter chains (WITHOUT sharing):

2023-12-12T20:43:16.474867Z     info    ads     LDS: PUSH for node:tcp-echo-86c789c6f4-wcqnx.default resources:26 size:87.1kB gzip:4.1kB lz4:6.2kB snappy:8.9kB
2023-12-12T20:43:16.480340Z     info    ads     LDS: PUSH for node:tcp-echo-86c789c6f4-ps89h.default resources:26 size:87.1kB gzip:4.1kB lz4:6.2kB snappy:8.9kB
2023-12-12T20:43:16.486553Z     info    ads     LDS: PUSH for node:tcp-echo-86c789c6f4-8nxr7.default resources:26 size:87.1kB gzip:4.1kB lz4:6.1kB snappy:9.1kB
2023-12-12T20:43:16.486848Z     info    ads     LDS: PUSH for node:tcp-echo-86c789c6f4-sr4s6.default resources:26 size:87.1kB gzip:4.1kB lz4:6.2kB snappy:8.9kB
2023-12-12T20:43:16.500054Z     info    ads     LDS: PUSH for node:tcp-echo-86c789c6f4-22g8d.default resources:26 size:87.1kB gzip:4.9kB lz4:6.1kB snappy:9.1kB
2023-12-12T20:43:16.500913Z     info    ads     LDS: PUSH for node:tcp-echo-86c789c6f4-jftlf.default resources:26 size:87.1kB gzip:4.1kB lz4:5.3kB snappy:9.2kB
2023-12-12T20:43:16.501592Z     info    ads     LDS: PUSH for node:tcp-echo-86c789c6f4-sk4fz.default resources:26 size:87.1kB gzip:4.1kB lz4:6.2kB snappy:9.1kB
2023-12-12T20:43:16.502457Z     info    ads     LDS: PUSH for node:tcp-echo-86c789c6f4-2cb4v.default resources:26 size:87.1kB gzip:4.1kB lz4:5.3kB snappy:9.1kB
2023-12-12T20:43:16.502990Z     info    ads     LDS: PUSH for node:tcp-echo-86c789c6f4-2hhj7.default resources:26 size:87.1kB gzip:4.9kB lz4:6.2kB snappy:9.2kB
2023-12-12T20:43:16.503606Z     info    ads     LDS: PUSH for node:tcp-echo-86c789c6f4-jlm48.default resources:26 size:87.1kB gzip:4.1kB lz4:6.2kB snappy:9.1kB

Filter chains (WITH sharing):

2023-12-12T20:35:29.832937Z     info    ads     LDS: PUSH for node:tcp-echo-86c789c6f4-7d6b5.default resources:26 size:64.5kB gzip:4.0kB lz4:4.9kB snappy:6.8kB
2023-12-12T20:35:29.835102Z     info    ads     LDS: PUSH for node:tcp-echo-86c789c6f4-7v2lh.default resources:26 size:64.5kB gzip:3.8kB lz4:4.8kB snappy:6.7kB
2023-12-12T20:35:29.835766Z     info    ads     LDS: PUSH for node:tcp-echo-86c789c6f4-6fn6z.default resources:26 size:64.5kB gzip:3.8kB lz4:4.8kB snappy:6.7kB
2023-12-12T20:35:29.838434Z     info    ads     LDS: PUSH for node:tcp-echo-86c789c6f4-x52lx.default resources:26 size:64.5kB gzip:3.8kB lz4:4.9kB snappy:6.7kB
2023-12-12T20:35:29.838821Z     info    ads     LDS: PUSH for node:tcp-echo-86c789c6f4-d7t66.default resources:26 size:64.5kB gzip:3.8kB lz4:4.8kB snappy:6.8kB
2023-12-12T20:35:29.839771Z     info    ads     LDS: PUSH for node:tcp-echo-86c789c6f4-5dpvl.default resources:26 size:64.5kB gzip:4.0kB lz4:4.8kB snappy:6.7kB
2023-12-12T20:35:29.839874Z     info    ads     LDS: PUSH for node:tcp-echo-86c789c6f4-4qwrh.default resources:26 size:64.5kB gzip:3.9kB lz4:4.8kB snappy:6.7kB
2023-12-12T20:35:29.841693Z     info    ads     LDS: PUSH for node:tcp-echo-86c789c6f4-zwf99.default resources:26 size:64.5kB gzip:3.9kB lz4:4.8kB snappy:6.7kB
2023-12-12T20:35:29.843061Z     info    ads     LDS: PUSH for node:tcp-echo-86c789c6f4-ml6bg.default resources:26 size:64.5kB gzip:3.8kB lz4:4.8kB snappy:6.7kB
2023-12-12T20:35:29.844861Z     info    ads     LDS: PUSH for node:tcp-echo-86c789c6f4-kx8q7.default resources:26 size:64.5kB gzip:3.8kB lz4:4.9kB snappy:6.7kB

Here it is in table form:		No Compression	GZIP	LZ4
Listener per pod	96.4kB	4.9kB	7.4kB	10.5kB
Filter chain per pod (no sharing)	87.1kB	4.9kB	6.2kB	9.1kB
Filter chain per pod (with sharing)	64.5kB	3.8kB	4.8kB	6.7kB

If we're trying to reduce the payload sizes, I think it is worth the additional CPU overhead of compressing the payloads, given the numbers above. Something like Snappy compression would likely introduce minimal overhead, but this is something we can easily measure to determine if it's worth it.

kyessenov commented 11 months ago

Can you please measure the CPU overhead of compression? Most of xDS categorizes as pure overhead since it has no relation to the data plane traffic (which users are typically $$$ for), so we should make sure we're not significantly increasing costs to xDS decoding since bandwidth to xDS is not a constraint so far.

envoyproxy / envoy

Scalable Pod IP matches in listener #30719