CFP: Scalable node-to-node connectivity health-checking

Problem statement

Currently, we recommend turning off node-to-node connectivity testing for larger clusters, from https://docs.cilium.io/en/stable/operations/performance/scalability/report/ :

--set endpointHealthChecking.enabled=false and --set healthChecking=false disable endpoint health checking entirely.
 However it is recommended that those features be enabled initially on a smaller cluster (3-10 nodes) 
where it can be used to detect potential packet loss due to firewall rules or hypervisor settings.

I believe there are three reasons for that:

By default, we export metrics with latency to each node separately, for example metric:

cilium_node_connectivity_latency_seconds{address_type="primary",protocol="http",source_cluster="kind-kind",source_node_name="kind-worker",target_cluster="kind-kind",target_node_ip="10.244.0.236",target_node_name="kind-control-plane",target_node_type="remote_intra_cluster",type="endpoint"} 0.000660872

which means, if you scrape cilium metrics, you would get O(n^2) metrics cardinality just from this metric alone.

Similarly, we have connectivity status, example metric:

cilium_node_connectivity_status{source_cluster="kind-kind",source_node_name="kind-worker",target_cluster="kind-kind",target_node_name="kind-control-plane",target_node_type="remote_intra_cluster",type="endpoint"} 1

that also includes source and destination node names, resulting in similar O(n^2) metrics cardinality.

We do not spread ICMP pings across time, source code which means that periodically we burst a bunch of pings at the same time
ProbeInterval is fixed to 60s: https://github.com/cilium/cilium/blob/d20f15ecab7c157f6246a07c857662bec491f6ee/cilium-health/launch/launcher.go#L37

New node-to-node connectivity health checking proposal

For problem 1, instead of having a node-to-node metric that has O(n^2) cardinality, we could have histogram metrics without destination node or destination cluster name, only "type":

cilium_node_connectivity_latency_bucket{source_node_name="kind-worker", source_cluster="kind-kind", type="node", le="0.005"} 0
cilium_node_connectivity_latency_bucket{source_node_name="kind-worker", source_cluster="kind-kind", type="node", le="0.01"} 0
...

Similarly solution for metric cilium_node_connectivity_status

Both of these metrics would provide a high-level overview of latencies and status connectivity. If user would notice an increase of latencies or a change of connectivity status, they can take a more in-depth look by running:

cilium-dbg status --verbose
or
cilium-health status 
or
cilium-health status --probe

on the affected node, which would still output a full matrix of destination nodes with information about latencies and status per each remote node.

For problems 2 and 3, instead of running probes every minute (including icmp probes with burst and regular HTTP which are spread over time), let's introduce health checking that has a (configurable) fixed qps of probes, for example, 5 qps as a default per each type of probe. This would mean that:

We have consistent overhead for running probes as number of nodes in cluster/clustermesh grows
Because we use histogram for latencies, users will have to use rate(cilium_node_connectivity_latency_bucket...), which means they will be only observing latency of fresh results, but for a subset of nodes
connectivity_status metric will contain "more stale" results as cluster/clustermesh grows (*)
Users still can trigger probing on demand, by cilium-health status --probe to get a full "fresh" matrix of statuses/latencies, or use cilium-health status to check currently reported metrics.
If users want to tradeoff cpu/memory usage for "fresher" metrics, they can do that by increasing fixed qps of probes.

(*) For example, with 5,000 nodes and 5 qps, it would take approximately 15 minutes to update a full connectivity status. We could also consider distinguishing between local-nodes and remote-nodes in clustermesh and probe them separately, while also exposing metrics with label like in-cluster=true/false

Thanks for putting together this proposal. Overall I think it makes sense. The O(n^2) nature of the problem is primarily what limits the scale.

I agree the metrics aspect is a problem, and we can solve this aspect with your proposal, independent of any other work. I think it'd make sense to drop the target node labels that cause high cardinality. :+1: . We could debate whether to keep target cluster, I can see arguments either way on that but I'd be fine with dropping that as well.

Getting to the primary issue here, the ping frequency at scale - I agree that it would make sense to calculate an expected probe period based on scale, perhaps using ClusterSizeDependentInterval or a variation thereof - Probably that facility is based on O(N) scaling rather than O(N^2). Something I'll note is that reworking the health server to operate this way will likely require rewriting a big chunk of the core of that server; the main reasons why the health server operates the way it does currently are:

Scale requirements were lower at the time we wrote it, and
The libraries out there for performing pings were not tailored to solve this problem.

In an ideal implementation, the server would probably just wake up every N time (N = Total Period / Number of nodes), initiate the pings asynchronously, then sleep until the next probe time. Of course the complicated parts that come into this are (a) the individual pings may trigger errors or just time out, so they would need to safely report the results back in asynchronously, and (b) the list of nodes is not static, so when they change, the period / frequency / set of nodes to ping also need to be updated. This implementation could probably do with a more detailed CFP to cover all the bases in the design.

Something else to note is that as long as the health server remains within the Cilium Pod, the lifecycle for restart/upgrade will also be bound to the Cilium agent. This means that for instance when upgrading or restarting Cilium, the metrics for connectivity status will be reset, and those metrics will be stale/incorrect for some time. There's a couple of potential ramifications for this: (1) Ideally we would also report the current expected metric update period, and make it clear in the docs that after startup it will take around that long before this metric is accurate. (2) Alternatively we could widen the scope a bit and consider whether there's a design that moves this component out of the Cilium Pod or tries to make it a bit more generic. The current design where you can get cluster network connectivity health out-of-the-box has nice ergonomics for users, as they don't need to do anything special to get additional assurance about the health of the network. But if we really want to solve this problem at scale, then there could be alternate designs in the solution space. I guess it depends how far we wish to deviate from what's already in place.

cilium / cilium

CFP: Scalable node-to-node connectivity health-checking #32820

Problem statement

New node-to-node connectivity health checking proposal