Building a resource assignment policy using NRI, including policies for native/compute resources.

klihub commented 4 years ago

We'd like to try and retrofit the current functionality for pod/container resource assignment in CRI Resource Manager/CRI-RM as an NRI plugin. Our goals are

primary: implement a reasonable default hardware-topology aware assignment policy, and
secondary: provide a way for plugging in special application-/vertical-specific policies

The resources of interest are

the tangible HW resources the kernel let's us arbitrate
- native/compute resources (CPU, memory, huge pages)
- devices
- LLC cache and memory bandwidth
a few other things the kernel let's us control, for instance
- block I/O throttling
- RT-scheduling/arbitration of time slices set aside for RT processes

To get our full policy scope working we'd need a way to

track information about running pods and containers
tap into container creation/deletion/update requests and let the runtime know about resource allocation decisions
have an assortment of pre- and post-variants/hooks related to container life-cycle events (create/delete/start/stop)
make changes to existing/running containers, not just the ones being created

These bits of functionality are necessary for the following reasons. 1) We need to track what resources are allocated to containers vs. what is free. Likewise, we need to be able to figure out the relationship between pods and containers, so things like intra-pod affinity can be implemented, where containers within a pod are put close to each other in the hardware topology sense. 2) Resources are assigned to containers in connection with certain lifecycle events. A resource allocation policy plugin needs to tap into these events, make decisions, and let the rest of the container runtime know about/apply the decisions. 3) Enforcing some of the policy decisions require more accurate alignment with the lifecycle of the container than the CRI requests can provide. For instance, assigning a container to an LLC cache clos/class happens by writing container process pids to a special pseudo-filesystem entry. Therefore enforcing a container LLC class is best done in connection with the CRI start request, once a process to run the container command has been forked but before it has actually exec'd the eventual container command. Tapping into the realted basic CRI requests do not provide the necessary resolution for achieving this. 4) Sometimes even simple resource policy decisions related to processing a CRI request has further resource-related consequences on containers other than the one directly involved in the CRI request. For instance, when a container running on a set of exclusively allocated CPUs is deleted, all containers without exlcusive CPU allocations and running in the same 'HW topology CPU pool' should be updated to allow them to run on the newly freed cpuset.

In its current incarnation, CRI-RM sits as a CRI proxy between clients (the kubelet only, really) and the runtime. It is non-transparent in nature as it might modify, according to policies, key CRI requests related to container lifecycle (creation, update) before forwarding them and it also might generate unsolicited requests to update otherwise unrelated containers that policy decisions had an (resource related) effect on. After an initial pod and container discovery, CRI-RM keeps its internal cache up to date according to the intercepted/modified CRI requests and responses, so it's 'NRI-like plugins' (the active policy running inside CRI-RM) has access to nearly all information about pods and containers.

Although due to the current proxy-based setup the current implementation can virtually modify any aspect of a container, the things we do and therefore currently think should be possible for NRI plugins (for our purposes) are the following:

alter container resources (cpuset.{cpus,mems}, CFS {shares,quota,period}, memory limit, hugepage limits)
alter devices
alter environment variables
add extra mounts (used for exposing/updating extra, resource-related information to containers)
probably/maybe alter annotations and labels

All except modifying container resources is limited/possible to alter only during container creation.

Currently it is unclear to me how/if the following things could be achieved with the current NRI architecture/infra:

tracking/querying all pods/sandboxes and containers (other than just the one the current request is directly operating on)
some of the alterations mentioned above (environment variables, mounts)
altering resources of containers other than the one being directly operated on by the current request

RenaudWasTaken commented 3 years ago

cc @crosbymichael

fuweid commented 3 years ago

/cc

klihub commented 3 years ago

@fuweid We have an initial limited proto of this (https://github.com/containerd/containerd/pull/6019). We're addressing the first round of review comments by Mike (who also asked me to ping you about this). Any feedback or other comments/questions are welcome.

fuweid commented 3 years ago

@klihub Thanks! Change is big and I need time to review this. Will comment in the PRs.

klihub commented 1 year ago

The recently merged PR #16 adds enough plumbing that most of the goals outlined above are now doable.

containerd / nri

Building a resource assignment policy using NRI, including policies for native/compute resources. #3