DAG for policy reconciliation

Problem statement

Kuadrant's current policy reconciliation process is too centered around the policy objects, not very (if anything at all) conscious of the topology underneath, other than by successively querying the cluster API.

This has been resulting in:

Occasional cyclic triggering of the reconciliation loop
Many requests to the kube API server (see also Fanout status update problems)
- Slow down the overall reconciliation loop
- Risk of being occasionally rate-limited by kube API server
Relative blindness by policy controller implementers about the different kinds of resource events that need to be watched
Relying too heavily on annotations to track the back-refs
Each new policy kind requires a lot of work to be implemented

Example-driven explanation

                                   ┌───────────┐
                 ┌──EnvoyFilter-1  │ Limitador │    ┌──EnvoyFilter-2
     rlp-1────┐  │                 └───────────┘    │
              │  ├──WasmPlugin-1                    ├──WasmPlugin-2
              ▼  │                                  │
           ┌─────┴┐                           ┌─────┴┐
     ┌────►│ gw-1 │◄────┬────────────┐  ┌────►│ gw-2 │◄────┐
     │     └──────┘     │            │  │     └──────┘     │
     │                  │            │  │                  │
     │                  │            │  │                  │
┌────┴────┐       ┌─────┴───┐      ┌─┴──┴────┐       ┌─────┴───┐
│ route-1 │       │ route-2 │      │ route-3 │       │ route-4 │
└─────────┘       └─────────┘      └─────────┘       └─────────┘
     ▲                                  ▲                  ▲
     │                                  │                  │
     │                                  │                  │
   rlp-2                              rlp-3              rlp-4

Reconciliation of rlp-2 (created after rlp-1) requires triggering the reconciliation of rlp-1 again, to recalculate the scope of rlp-1 – i.e. to update WasmPlugin-1 and Limitador, which in turn have just been updated because rlp-2 itself
Similarly, rlp-3 requires recalculating WasmPlugin-1 and Limitador, apart from creating EnvoyFilter-2 and WasmPlugin-2
Getting to the affected gateways involves: a. inspecting the specs of the targeted routes for parentRefs; b. listing all RLPs for gateway-targeting ones; c. trusting the state of the back-ref annotations.
Reconciliation of any policy event involves trying to detect what kind of event triggered it – i.e. policy created/updated/deleted, route created/updated/deleted, gateway created/updated/deleted
Other events need to be watched for reconciliation back from the source of truth (policies + network topology) – e.g. wasmplugin/envoyfilter/limitador modified/deleted

Possible solution

Keep a version of the topology in-memory as a DAG (Directed Acyclic Graph)
Rely more on the informers pattern, to replace/complement controller-runtime, possibly replacing the “traditional” reconciliation loops as we known them today
Recompute the effective policies top-down, from affected gateways and downwards to the leaves
Distinguish between events that affect the topology, events that just require recomputing and reapplying effective policies, and events that just require reapplying previously computed states.

Reasons to do it

Reduce (significantly) the number of requests to kube API, therefore also improve performance (speed) of reconciliation
Move away from annotations as the way to track back-refs to the policies, by relying on the DAG to navigate the topology instead
Simplify reconciliation loop regarding detection of the kind of resource event
Improve clarity regarding the different kinds of events that trigger reconciliation (by having to define each kind of event and corresponding callback function) → improve coverage of scenarios (kinds of resource events)
Possibility to react quicker and more efficiently, by sometimes not having to trigger “full” reconciliation but acting more directly according to each kind of event

Reason NOT to do it

Involves rewriting the operators
Possibly more resources (CPU, Mem) required by the policy controller

Challenges

Bootstrapping the tree of pre-existing resources in-memory may take some non-negligible time – i.e. consider the impact for the readiness state of the controller
Achieve enough level of abstraction so it works for all policy implementers (i.e. not only for Kuadrant)
Avoid re-inventing the wheel – watch out for weird combination of the informers patterns and straightforward reconcilers
Reeducate devs on the new pattern – no longer “textbook” controller-runtime

Kuadrant / architecture