How kube-rs library works

marshtompsxd commented 1 year ago

We plan to build the controller implementation on top of the kube-rs library: https://github.com/kube-rs/kube.

To implement a controller using kube-rs, the developer needs to provide three things.

A reconcile function: async fn reconcile(Arc<CR>, Arc<Ctx>) -> Result<Action, Error>
Triggering conditions
Error policy

Reconcile function

Each controller program can have multiple reconcile functions, and each reconcile function only manages one CR (custom resource) type. The controller monitors all the instances (objects) of the CR type and calls the reconcile function to make sure the actual cluster state matches the desired state described in the CR object. The reconcile function is only called when the cluster state related to some CR object is changed (see the triggering conditions below).

The first argument of the reconcile function points to the related CR object, and the second usually points to some context data provided by the developer (e.g., the client to talk to Kubernetes API). Note that the reconcile function does not know exactly what event triggers it. Instead, it only knows that the cluster state related to the CR object just got changed so it needs to take some action to reconcile the cluster state again. This pattern is also called level-triggering.

The reconcile function usually scans the objects related to (or, owned by) the triggering CR object, checks whether they match the desired state, and issues corrective updates if not.

The reconcile function returns a Result object which can be Action or Error. There are two types of Action: (1) requeue the reconcile for the same object, which means reconcile will be called with the same argument in X seconds, or (2) wait for the next triggering event. When the error is returned, the developer-specified error policy will decide the next action, which usually requeues the reconcile function.

Triggering conditions

The triggering conditions decide when the reconcile function is called. There are four types of triggering conditions:

If any CR (custom resource) object o is created/modified, trigger with o
If any T object is created/modified/deleted, trigger with the owner CR object o
If any T object is created/modified/deleted, trigger with the relevant CR object o
If any event is sent from a channel, trigger with every CR object stored in the cache

The first condition is compulsory and the others are optional. Most controllers will choose condition 2 for all the objects created by the controller.

Error policy

The error policy is usually simple: just requeue the reconcile.

Important components of kube-rs framework

The kube-rs framework does a lot of work in the background to make it easy to write controller programs. The figure below shows the interaction between kube-rs, reconcile function and Kubernetes API:

At a high level, kube-rs (1) sets up watchers that keep monitoring changes to different types of objects in Kubernetes API, (2) filters out irrelevant events according to the triggering conditions, and (3) invokes the reconcile function with the related CR object. kube-rs has several important components:

Watcher, which watches the cluster state and receives a stream of notification from Kubernetes API. Note that each watcher only watches one type of objects.
Reflector, which maintains a cache that contains the CR objects only. The cache is updated by the stream from watcher. Note that reconcile function does NOT read the cache maintained by the reflector. The cache is only used by the triggering condition No. 4 documented above.
Scheduler(not shown in the above figure), which maintains a queue of all the reconcile requests. The scheduler deduplicates reconcile requests: if a reconcile request with CR object o comes while there is another reconcile request with the same CR object exists in the queue, the request that is scheduled later will be discarded and the other one remains in (or, gets added to) the queue.
Runner(not shown in the above figure), which polls the scheduler to get the scheduled reconcile request and invokes the reconcile function with the corresponding CR object. Note that the runner makes sure it never runs reconcile concurrently for the same CR object. Concurrent reconcile for different CR objects is allowed.

Watcher

Watcher mainly does two things: (1) issues an initial list to the Kubernetes API for a particular type of resources and (2) continuously watches all the following changes on the resource from the Kubernetes API. Each controller sets up a watcher for each resource type it needs to watch for (including the custom resource type).

More concretely, each watcher starts a future stream by runing a state machine in step_trampolined: it starts from state empty, and issues a list for a particular resource type to the Kubernetes API; after the list succeeds, it will start to watch for any changes to objects of this resource type. Every object read by the list or watch will be sent to the future stream.

Every time when calling new to construct a controller or calling owns to set up more triggers, a watcher will be set up for watching the particular type of resource objects. All the watchers of each controller will be grouped into one trigger_selector. Later when run is called, an applier is constructed and the trigger_selector (as a queue) is used to feed the stream of objects to the runner and scheduler which trigger the reconcile function with the resource object later.

lalithsuresh commented 1 year ago

Scheduler(not shown in the above figure), which maintains a queue of all the reconcile requests. The scheduler deduplicates reconcile requests: if a reconcile request with CR object o comes while there is another reconcile request with the same CR object exists in the queue, the request that is scheduled later will be discarded and the other one remains in (or, gets added to) the queue.

Interesting. So if two notifications for an object o happen for versions o1 followed by o2 while the reconcile for o1 has not yet happened, then reconcile() will only be called with the argument o1 and not o2?

marshtompsxd commented 1 year ago

Yes. But the best practice is to always first call k8s_client.Get(o.key) when entering reconcile() so that you don't make decisions based on the older version of o (tho we know this doesn't completely prevent staleness issues...)

lalithsuresh commented 1 year ago

Indeed. Even calling Get(o.key) still leaves open the possibility for time-of-check/time-of-use bugs.

anvil-verifier / anvil