Open howardjohn opened 1 year ago
In general, I think we should not use network inputs for matching here (so not using additional listeners or addresses to match destination IP), so that leaves option (3) with an extensible filter chain matcher that has its own config delivery, separated from the network flows.
Regarding a move to 1 big listener with a FilterChian per pod, would we really lose the ability to update things? Looking at the docs for a filter chain only update, I think it would be effectively incremental because it doesn't drain connections on filter chains that were not changed in the update:
If the new filter chain and the old filter chain is protobuf message equivalent, the corresponding filter chain runtime info survives. The connections owned by the survived filter chains remain open.
This would be the desired behavior, right? If so, I think all that's needed is to configure filter chains instead of full blown listeners for each pod.
Yes, @tonya11en the incremental drain is not affected. A change in the unified matcher logic does not trigger filter chain drains either, which solves (2):
Changing any IP of them causes a drain for all connections, even to other IPs (I am 95% sure, want to double check though)
I missed a bit of context I think.
There are two concerns around updates: whether it drains, and how much we have to update at a time from an xDS perspective (delta XDS)
Current state: we have ~1 listener per service. We do not do incremental updates, so if any change we push the entire set of listeners. In some customers, this is O(10mb). Changes are expensive (push all), but do not drain generally due to per-FC draining + isolated listeners.
Add delta: we could do delta XDS with listeners. This would mean we can incrementally push 1 listener over xds. This is a performance optimization only, no traffic facing behavior.
Move to single listener with filter chains: This removes the ability to do delta. We keep draining behavior due to per-FC drain. We cannot represent the current behavior since we have per-listener listener filters today. This config exists but isn't powerful enough; presumably we could extend it I don't know its possible to represent our current matching logic with filter_chain_match since I have only looked at filter_chain_matcher. We lose "Delta XDS" improvement we made.
Move to single listener with filter chains and add new "FilterChainDS": Same as above, but gives us back "Delta XDS" improvement. The precise listener filter enablement may make this kind of annoying, though.
Move to single listener with filter chains, add new "FilterChainDS", use filter_chain_matcher: This is a lot more powerful, and may be required to represent our current config (not sure). Downside is this is somewhat antagonistic to FilterChainDS -- we would need to have the full match tree pushed if any of the tree changes. We should evaluate the size of the full tree in a large cluster setup. Above I mentioned 10mb -- is the tree 90% of that? 1%? If its a small amount, this may be fine. If its a large amount, then this doesn't help much.
Mostly agree with the analysis but I think you missed a major point. You can reuse filter chains with one big direct listener that you cannot do with multiple indirect listeners. This factor will dominate and be close to the optimal representation which is: O(number of IPs that factor in decision) + O(number of different network datapaths). We save the whole multiplicative factor here.
What if we used internal listeners? It may not affect the xDS payload sizes much, but it'll certainly be less resource utilization and let's us incrementally update the listeners via delta xDS someday.
What I found neat about it is that we can do fancy things with endpoint metadata matching:
name: cluster_0
load_assignment:
cluster_name: cluster_0
endpoints:
- lb_endpoints:
- endpoint:
address:
envoy_internal_address:
server_listener_name: some_internal_listener
metadata:
...
I'll do some actual due-diligence tomorrow, but wanted to throw this idea out there in case one of you already considered and ruled it out.
Yeah, we looked at the internal listener for xDS composition. It didn't work out well, since it's expensive on the datapath with the extra buffers and doubling of listener overhead. Envoy performance was so bad that a simple Go/Rust implementation would outperform it. Internal listeners should be reserved for actual tunneling.
Also listener updates can be optimized if needed. I thought Envoy already skips drains when only certain parts of listener config was affected. This can be extended to matchers if it is not already.
@yanavlasov I'm looking at this, so you can assign the issue to me
@yanavlasov Drains are controlled by the filter chain names, connections tied to chains that are not modified are left intact in both the unified matcher and the old matcher.
I quickly hacked the filter chain approach to see how much it reduces the payload sizes. Doesn't look very promising :(.
50 replicas in the headless service.
Listener per pod:
2023-12-12T00:22:50.483092Z info ads LDS: PUSH for node:tcp-echo-86c789c6f4-r7c57.default resources:122 size:86.8kB
2023-12-12T00:22:50.486530Z info ads LDS: PUSH for node:tcp-echo-86c789c6f4-mt99c.default resources:122 size:86.8kB
2023-12-12T00:22:50.486587Z info ads LDS: PUSH for node:tcp-echo-86c789c6f4-fj6lj.default resources:122 size:86.8kB
2023-12-12T00:22:50.486692Z info ads LDS: PUSH for node:tcp-echo-86c789c6f4-l7vbd.default resources:122 size:86.8kB
2023-12-12T00:22:50.487374Z info ads LDS: PUSH for node:tcp-echo-86c789c6f4-bwmql.default resources:122 size:86.8kB
2023-12-12T00:22:50.487388Z info ads LDS: PUSH for node:tcp-echo-86c789c6f4-7wtcn.default resources:122 size:86.8kB
2023-12-12T00:22:50.487716Z info ads LDS: PUSH for node:tcp-echo-86c789c6f4-tc9rp.default resources:122 size:86.8kB
2023-12-12T00:22:50.487833Z info ads LDS: PUSH for node:tcp-echo-86c789c6f4-qqsff.default resources:122 size:86.8kB
2023-12-12T00:22:50.489071Z info ads LDS: PUSH for node:tcp-echo-86c789c6f4-kf6l9.default resources:122 size:86.8kB
2023-12-12T00:22:50.494092Z info ads LDS: PUSH for node:tcp-echo-86c789c6f4-5w5kr.default resources:122 size:86.8kB
Filter chain per pod:
2023-12-12T00:17:42.020920Z info ads LDS: PUSH for node:tcp-echo-86c789c6f4-nzbcl.default resources:26 size:85.1kB
2023-12-12T00:17:42.027833Z info ads LDS: PUSH for node:tcp-echo-86c789c6f4-ncn9d.default resources:26 size:85.1kB
2023-12-12T00:17:42.031512Z info ads LDS: PUSH for node:tcp-echo-86c789c6f4-zxcrc.default resources:26 size:85.1kB
2023-12-12T00:17:42.031938Z info ads LDS: PUSH for node:tcp-echo-86c789c6f4-vhv79.default resources:26 size:85.1kB
2023-12-12T00:17:42.034123Z info ads LDS: PUSH for node:tcp-echo-86c789c6f4-7td5w.default resources:26 size:85.1kB
2023-12-12T00:17:42.037265Z info ads LDS: PUSH for node:tcp-echo-86c789c6f4-2fd7n.default resources:26 size:85.1kB
2023-12-12T00:17:42.039333Z info ads LDS: PUSH for node:tcp-echo-86c789c6f4-z65z6.default resources:26 size:85.1kB
2023-12-12T00:17:42.036012Z info ads LDS: PUSH for node:tcp-echo-86c789c6f4-cwz7d.default resources:26 size:85.1kB
2023-12-12T00:17:42.041103Z info ads LDS: PUSH for node:tcp-echo-86c789c6f4-jbl8z.default resources:26 size:85.1kB
2023-12-12T00:17:42.036177Z info ads LDS: PUSH for node:tcp-echo-86c789c6f4-gxcpb.default resources:26 size:85.1kB
Even though the listener resource count was much lower (26 vs 122), the actual payload size didn't decrease by much (85.1kB vs 86.8kB). I can try this again with more pods to see if the reduction is more noticeable, but will need to fiddle with my kind
config. I'll do this next.
Are you sharing the filter chain config across the pods? I think the improvement requires that you have one shared filter chain for all headless pods. This has to be more efficient, just like we see with ECDS in https://github.com/istio/istio/pull/48256.
Good point. When I share filter chains it drops to ~62kB. I'll need to make a point to test the proxy resource utilization and see if there's a difference there.
Next, I'll get some numbers on the kinds of savings we'd expect if we compressed the payloads.
I wanted to see what kinds of payload size savings we'd expect if we compressed the gRPC response payloads. Currently, gRPC supports gzip, lz4, and snappy compression.
No changes (listener per pod):
2023-12-12T20:49:13.862196Z info ads LDS: PUSH for node:tcp-echo-86c789c6f4-952v5.default resources:122 size:96.4kB gzip:5.0kB lz4:7.5kB snappy:10.6kB
2023-12-12T20:49:13.866110Z info ads LDS: PUSH for node:tcp-echo-86c789c6f4-f7b9v.default resources:122 size:96.4kB gzip:5.0kB lz4:7.4kB snappy:10.6kB
2023-12-12T20:49:13.870051Z info ads LDS: PUSH for node:tcp-echo-86c789c6f4-8d4fx.default resources:122 size:96.4kB gzip:5.0kB lz4:7.4kB snappy:10.6kB
2023-12-12T20:49:13.871056Z info ads LDS: PUSH for node:tcp-echo-86c789c6f4-nm6vs.default resources:122 size:96.4kB gzip:4.9kB lz4:7.4kB snappy:10.5kB
2023-12-12T20:49:13.870131Z info ads LDS: PUSH for node:tcp-echo-86c789c6f4-2ck7b.default resources:122 size:96.4kB gzip:4.9kB lz4:7.3kB snappy:10.5kB
2023-12-12T20:49:13.884184Z info ads LDS: PUSH for node:tcp-echo-86c789c6f4-xwbbd.default resources:122 size:96.4kB gzip:4.9kB lz4:7.4kB snappy:10.4kB
2023-12-12T20:49:13.886300Z info ads LDS: PUSH for node:tcp-echo-86c789c6f4-lpjbv.default resources:122 size:96.4kB gzip:5.0kB lz4:7.4kB snappy:10.5kB
2023-12-12T20:49:13.889805Z info ads LDS: PUSH for node:tcp-echo-86c789c6f4-sqzg9.default resources:122 size:96.4kB gzip:4.9kB lz4:7.4kB snappy:10.5kB
2023-12-12T20:49:13.902850Z info ads LDS: PUSH for node:tcp-echo-86c789c6f4-n7qsv.default resources:122 size:96.4kB gzip:5.0kB lz4:7.4kB snappy:10.5kB
2023-12-12T20:49:13.902855Z info ads LDS: PUSH for node:tcp-echo-86c789c6f4-9hn6w.default resources:122 size:96.4kB gzip:4.9kB lz4:7.4kB snappy:10.5kB
Filter chains (WITHOUT sharing):
2023-12-12T20:43:16.474867Z info ads LDS: PUSH for node:tcp-echo-86c789c6f4-wcqnx.default resources:26 size:87.1kB gzip:4.1kB lz4:6.2kB snappy:8.9kB
2023-12-12T20:43:16.480340Z info ads LDS: PUSH for node:tcp-echo-86c789c6f4-ps89h.default resources:26 size:87.1kB gzip:4.1kB lz4:6.2kB snappy:8.9kB
2023-12-12T20:43:16.486553Z info ads LDS: PUSH for node:tcp-echo-86c789c6f4-8nxr7.default resources:26 size:87.1kB gzip:4.1kB lz4:6.1kB snappy:9.1kB
2023-12-12T20:43:16.486848Z info ads LDS: PUSH for node:tcp-echo-86c789c6f4-sr4s6.default resources:26 size:87.1kB gzip:4.1kB lz4:6.2kB snappy:8.9kB
2023-12-12T20:43:16.500054Z info ads LDS: PUSH for node:tcp-echo-86c789c6f4-22g8d.default resources:26 size:87.1kB gzip:4.9kB lz4:6.1kB snappy:9.1kB
2023-12-12T20:43:16.500913Z info ads LDS: PUSH for node:tcp-echo-86c789c6f4-jftlf.default resources:26 size:87.1kB gzip:4.1kB lz4:5.3kB snappy:9.2kB
2023-12-12T20:43:16.501592Z info ads LDS: PUSH for node:tcp-echo-86c789c6f4-sk4fz.default resources:26 size:87.1kB gzip:4.1kB lz4:6.2kB snappy:9.1kB
2023-12-12T20:43:16.502457Z info ads LDS: PUSH for node:tcp-echo-86c789c6f4-2cb4v.default resources:26 size:87.1kB gzip:4.1kB lz4:5.3kB snappy:9.1kB
2023-12-12T20:43:16.502990Z info ads LDS: PUSH for node:tcp-echo-86c789c6f4-2hhj7.default resources:26 size:87.1kB gzip:4.9kB lz4:6.2kB snappy:9.2kB
2023-12-12T20:43:16.503606Z info ads LDS: PUSH for node:tcp-echo-86c789c6f4-jlm48.default resources:26 size:87.1kB gzip:4.1kB lz4:6.2kB snappy:9.1kB
Filter chains (WITH sharing):
2023-12-12T20:35:29.832937Z info ads LDS: PUSH for node:tcp-echo-86c789c6f4-7d6b5.default resources:26 size:64.5kB gzip:4.0kB lz4:4.9kB snappy:6.8kB
2023-12-12T20:35:29.835102Z info ads LDS: PUSH for node:tcp-echo-86c789c6f4-7v2lh.default resources:26 size:64.5kB gzip:3.8kB lz4:4.8kB snappy:6.7kB
2023-12-12T20:35:29.835766Z info ads LDS: PUSH for node:tcp-echo-86c789c6f4-6fn6z.default resources:26 size:64.5kB gzip:3.8kB lz4:4.8kB snappy:6.7kB
2023-12-12T20:35:29.838434Z info ads LDS: PUSH for node:tcp-echo-86c789c6f4-x52lx.default resources:26 size:64.5kB gzip:3.8kB lz4:4.9kB snappy:6.7kB
2023-12-12T20:35:29.838821Z info ads LDS: PUSH for node:tcp-echo-86c789c6f4-d7t66.default resources:26 size:64.5kB gzip:3.8kB lz4:4.8kB snappy:6.8kB
2023-12-12T20:35:29.839771Z info ads LDS: PUSH for node:tcp-echo-86c789c6f4-5dpvl.default resources:26 size:64.5kB gzip:4.0kB lz4:4.8kB snappy:6.7kB
2023-12-12T20:35:29.839874Z info ads LDS: PUSH for node:tcp-echo-86c789c6f4-4qwrh.default resources:26 size:64.5kB gzip:3.9kB lz4:4.8kB snappy:6.7kB
2023-12-12T20:35:29.841693Z info ads LDS: PUSH for node:tcp-echo-86c789c6f4-zwf99.default resources:26 size:64.5kB gzip:3.9kB lz4:4.8kB snappy:6.7kB
2023-12-12T20:35:29.843061Z info ads LDS: PUSH for node:tcp-echo-86c789c6f4-ml6bg.default resources:26 size:64.5kB gzip:3.8kB lz4:4.8kB snappy:6.7kB
2023-12-12T20:35:29.844861Z info ads LDS: PUSH for node:tcp-echo-86c789c6f4-kx8q7.default resources:26 size:64.5kB gzip:3.8kB lz4:4.9kB snappy:6.7kB
Here it is in table form: | No Compression | GZIP | LZ4 | SNAPPY | |
---|---|---|---|---|---|
Listener per pod | 96.4kB | 4.9kB | 7.4kB | 10.5kB | |
Filter chain per pod (no sharing) | 87.1kB | 4.9kB | 6.2kB | 9.1kB | |
Filter chain per pod (with sharing) | 64.5kB | 3.8kB | 4.8kB | 6.7kB |
If we're trying to reduce the payload sizes, I think it is worth the additional CPU overhead of compressing the payloads, given the numbers above. Something like Snappy compression would likely introduce minimal overhead, but this is something we can easily measure to determine if it's worth it.
Can you please measure the CPU overhead of compression? Most of xDS categorizes as pure overhead since it has no relation to the data plane traffic (which users are typically $$$ for), so we should make sure we're not significantly increasing costs to xDS decoding since bandwidth to xDS is not a constraint so far.
Title: Scalable Pod IP matches in listener
Description: In Istio, we setup listeners with bind_to_port=false and use original destination redirection to match them. So we, roughly, have a Listener per kubernetes Service, matching on the Service IP. This works well enough.
The problem we have is traffic not to services. In Istio today, we support this only if there is a headless service associated with the pod. A headless service is more or less just a way to expose the backends via DNS, compared to a normal service which has VIP. From Envoy's perspective, they requests look the same. We only support headless services, not all direct-to-pod, since our approach is terrible from a scale perspective. What we currently do is just like services, we have a listener per pod.
This scales quite poorly. Pods are large in number and churning often. Listeners are large, and we haven't fully adopted delta XDS to send incremental updates for them.
Our current model is really not viable. We see for customers with modest headless services (~250 pods) have LDS payloads of 10mb+. Along with these churning frequently causing full pushes of the entire 10mb, this ends up being pretty terrible
We would like to improve on this state. Options discussed below:
No Envoy changes needed, this is a pure Istio side change. While we are working on this (for an number of reasons), this still doesn't fully solve the problem. Envoy still needs to full massive set of listeners, so its only an incremental improvement
Instead of N listeners for a service, we can have 1 listener with N additional addresses. This almost works well, but a few problems:
This gives us more flexibility, as we can just stick a list of IPs. This would mostly fix this problem. However, it contradicts (1). If we move to a single giant listener, we lose the ability to incrementally update things.
A theoretical new FilterChainDS could help this. Its not clear how this would interact with the new tree matcher, though.