Closed klihub closed 1 year ago
cc @crosbymichael
/cc
@fuweid We have an initial limited proto of this (https://github.com/containerd/containerd/pull/6019). We're addressing the first round of review comments by Mike (who also asked me to ping you about this). Any feedback or other comments/questions are welcome.
@klihub Thanks! Change is big and I need time to review this. Will comment in the PRs.
The recently merged PR #16 adds enough plumbing that most of the goals outlined above are now doable.
We'd like to try and retrofit the current functionality for pod/container resource assignment in CRI Resource Manager/CRI-RM as an NRI plugin. Our goals are
The resources of interest are
To get our full policy scope working we'd need a way to
These bits of functionality are necessary for the following reasons. 1) We need to track what resources are allocated to containers vs. what is free. Likewise, we need to be able to figure out the relationship between pods and containers, so things like intra-pod affinity can be implemented, where containers within a pod are put close to each other in the hardware topology sense. 2) Resources are assigned to containers in connection with certain lifecycle events. A resource allocation policy plugin needs to tap into these events, make decisions, and let the rest of the container runtime know about/apply the decisions. 3) Enforcing some of the policy decisions require more accurate alignment with the lifecycle of the container than the CRI requests can provide. For instance, assigning a container to an LLC cache clos/class happens by writing container process pids to a special pseudo-filesystem entry. Therefore enforcing a container LLC class is best done in connection with the CRI start request, once a process to run the container command has been forked but before it has actually exec'd the eventual container command. Tapping into the realted basic CRI requests do not provide the necessary resolution for achieving this. 4) Sometimes even simple resource policy decisions related to processing a CRI request has further resource-related consequences on containers other than the one directly involved in the CRI request. For instance, when a container running on a set of exclusively allocated CPUs is deleted, all containers without exlcusive CPU allocations and running in the same 'HW topology CPU pool' should be updated to allow them to run on the newly freed cpuset.
In its current incarnation, CRI-RM sits as a CRI proxy between clients (the kubelet only, really) and the runtime. It is non-transparent in nature as it might modify, according to policies, key CRI requests related to container lifecycle (creation, update) before forwarding them and it also might generate unsolicited requests to update otherwise unrelated containers that policy decisions had an (resource related) effect on. After an initial pod and container discovery, CRI-RM keeps its internal cache up to date according to the intercepted/modified CRI requests and responses, so it's 'NRI-like plugins' (the active policy running inside CRI-RM) has access to nearly all information about pods and containers.
Although due to the current proxy-based setup the current implementation can virtually modify any aspect of a container, the things we do and therefore currently think should be possible for NRI plugins (for our purposes) are the following:
All except modifying container resources is limited/possible to alter only during container creation.
Currently it is unclear to me how/if the following things could be achieved with the current NRI architecture/infra: