containerd / nri

Node Resource Interface
Apache License 2.0
220 stars 58 forks source link

Extend scope to enable common pluggable runtime extensions. #16

Closed klihub closed 1 year ago

klihub commented 2 years ago

Background

This PR aims to extend NRI beyond its current v1 capabilities. The primary goal is to allow NRI to act as a common infrastructure for plugging in extensions to runtimes. Among other things, these runtime extensions could then implement vendor- and domain-specific custom logic for improved container configuration.

The PR is both related to this KubeCon talk, and to this NRI issue.

Description

Note: For a more detailed description please refer to the README.

This PR

Using the plugin- and runtime-integration APIs allows one to hook plugins into the lifecycle of containers managed by a runtime. These plugins can then adjust a limited set of container parameters in connection with some of the container's lifecycle state transitions.

The proposal changes the v1 hook-like incarnation of plugins to more daemon-like entities, where a single instance of a plugin is responsible for handling the full stream of events intended for that plugin. It also defines a ttRPC-based plugin protocol which allows plugins to subscribe to lifecycle events of pods and containers and to make changes to containers during a container's creation, update and stop events. Adjustment is possible to the following container parameters during a container's creation:

Other than during creation, a container's assigned resources can be updated by plugin.

Related PRs

There are two more PRs closely related to this proposal:

Additionally there is a work-in-progress implementation of a more complex, custom resource assignment algorithm which exercises most of the capabilities offered by this proposal.

cpuguy83 commented 2 years ago

I'm not sure I like the idea of making this much more complex. Today one can do stateful plugins by having the plugin call out to a daemon using whatever protocol they like.

That said, I keep wanting use NRI to do things and it doesn't have the information available to do what I need.

klihub commented 2 years ago

I'm not sure I like the idea of making this much more complex. Today one can do stateful plugins by having the plugin call out to a daemon using whatever protocol they like.

There are undeniable pros with a hook-like setup, for instance it does not matter how much you leak memory for processing a single request because the leaks do not cumulate. Then there are cons, for instance how much overhead you incur for processing a single request, whether your request processing is typically stateful (IOW if the result of processing a request has effects of subsequent requests or not) and so on. I guess whether the pros outweigh the cons with a hook-like approach or a daemon-like one depends on the collective nature of things you try to accomplish with NRI.

That said, I keep wanting use NRI to do things and it doesn't have the information available to do what I need.

If it is possible to share what you'd like to accomplish using NRI it would be interesting to hear it. We could check if our proposed changes could make those possible and if not what it would take to make them possible.

mikebrow commented 2 years ago

Note: with respect to heading to an rpc (ttrpc/grpc) see cni v2 discussion migrating to rpc https://github.com/containernetworking/cni/issues/785

codecov-commenter commented 2 years ago

Codecov Report

Base: 0.00% // Head: 68.97% // Increases project coverage by +68.97% :tada:

Coverage data is based on head (ad343a8) compared to base (c2b2513). Patch coverage: 68.97% of modified lines in pull request are covered.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #16 +/- ## ========================================= + Coverage 0 68.97% +68.97% ========================================= Files 0 8 +8 Lines 0 1531 +1531 ========================================= + Hits 0 1056 +1056 - Misses 0 344 +344 - Partials 0 131 +131 ``` | [Impacted Files](https://codecov.io/gh/containerd/nri/pull/16?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=containerd) | Coverage Δ | | |---|---|---| | [pkg/net/conn.go](https://codecov.io/gh/containerd/nri/pull/16/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=containerd#diff-cGtnL25ldC9jb25uLmdv) | `34.48% <34.48%> (ø)` | | | [pkg/adaptation/config.go](https://codecov.io/gh/containerd/nri/pull/16/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=containerd#diff-cGtnL2FkYXB0YXRpb24vY29uZmlnLmdv) | `35.48% <35.48%> (ø)` | | | [pkg/adaptation/plugin.go](https://codecov.io/gh/containerd/nri/pull/16/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=containerd#diff-cGtnL2FkYXB0YXRpb24vcGx1Z2luLmdv) | `47.80% <47.80%> (ø)` | | | [pkg/adaptation/plugin\_linux.go](https://codecov.io/gh/containerd/nri/pull/16/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=containerd#diff-cGtnL2FkYXB0YXRpb24vcGx1Z2luX2xpbnV4Lmdv) | `50.00% <50.00%> (ø)` | | | [pkg/adaptation/adaptation.go](https://codecov.io/gh/containerd/nri/pull/16/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=containerd#diff-cGtnL2FkYXB0YXRpb24vYWRhcHRhdGlvbi5nbw==) | `68.27% <68.27%> (ø)` | | | [pkg/net/multiplex/mux.go](https://codecov.io/gh/containerd/nri/pull/16/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=containerd#diff-cGtnL25ldC9tdWx0aXBsZXgvbXV4Lmdv) | `69.07% <69.07%> (ø)` | | | [pkg/adaptation/result.go](https://codecov.io/gh/containerd/nri/pull/16/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=containerd#diff-cGtnL2FkYXB0YXRpb24vcmVzdWx0Lmdv) | `79.14% <79.14%> (ø)` | | | [pkg/net/socketpair\_unix.go](https://codecov.io/gh/containerd/nri/pull/16/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=containerd#diff-cGtnL25ldC9zb2NrZXRwYWlyX3VuaXguZ28=) | `81.25% <81.25%> (ø)` | | Help us with your feedback. Take ten seconds to tell us [how you rate us](https://about.codecov.io/nps?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=containerd). Have a feature suggestion? [Share it here.](https://app.codecov.io/gh/feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=containerd)

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

klihub commented 2 years ago

@AkihiroSuda @mikebrow @haircommander If you have some time, PTAL. I tried to address a bunch of the first review round comments: simplified the ttRPC protocol a bit, added a bunch of test cases, and fixed bugs found by those, cleaned up some of the most offending bits. Also updated the related containerd PR against this. Haven't gotten around to give the corresponding CRI-O PR the same treatment yet...

dmcgowan commented 2 years ago

Some follow ups from the community meeting:

klihub commented 1 year ago

Some follow ups from the community meeting:

  • We need to make a decision about the plugin model to support in containerd, whether we can generalize it for this use case using the shim model (@fuweid, @mikebrow )

NRI itself does not need much for plugins so I think anything should be fine for us, as long as plugins can

  • Whether this can be integrated in containerd directly with the the sandbox model along with the CRI updates to break it into components (@mxpv)
fengwang666 commented 1 year ago

@klihub I'm interested in using nri to allocate node resources with Kata Containers. Is there any expectation when this will be merged to containerd?

klihub commented 1 year ago

@klihub I'm interested in using nri to allocate node resources with Kata Containers. Is there any expectation when this will be merged to containerd?

@fengwang666 We have been making very little practical progress (as in terms of reviews/feedback) and slowly on this front. In the previous containerd community meeting this was briefly discussed (wrt. some other pending PRs and this one). One of the reasons is of course, that this is really a huge piece to chew/review. But another reason seems to be that the key upstream maintainers and contributors are still very busy working on their own stuff/turf for this release cycle so they can't spare time for reviewing big PRs. I hope that this situation will gradually ease during the fall/autumn as/if those key folks get their own stuff merged, freeing up more capacity for reviews.

That said, I really don't have any date estimates. My hope is that upstream involvement activity in reviews would gradually pick up, then we could start making real practical progress, and hopefully get a initial version in by the end of the year. Meanwhile one of the maintainers offered his help to shepherd me in this and I'm working on the next iterations of this and the containerd integration PRs, trying to clean up/improve anything I can.

If you are interested in this, then any and all feedback is more then welcome. Also if you are interested in writing an experimental plugin of your own to test and see if anything is broken or missing in the current version from your point of view, then I can help you get started with that.

haosdent commented 1 year ago

@klihub Awesome work! A naive question, how to make k8s-scheduler aware of the changes from NRI? Via `labels, annotations?

klihub commented 1 year ago

@klihub Awesome work! A naive question, how to make k8s-scheduler aware of the changes from NRI? Via `labels, annotations?

@haosdent, @kad, @askervin I am not sure what exactly you mean by that question so I can only guess. But in general, regardless of whether you experiment with new resource assignment algorithms as external to kubelet or as new kubelet *Manager policies, the algorithm you implement should not violate the general assumptions and therefore break the accounting calculation/book-keeping made by/in the scheduler.

I know that adhering to that principle then presents certain limitations for what your algorithms can do. Those limitations are already a source of problem for some of the existing policies/algorithms in the current *Managers' in kubelet. But this is IMO a related but entirely independent problem of its own in the current K8s infrastructure which anyway needs to be solved independently and regardless of what we are doing here. We do have some initial ideas about how to lift some of those restrictions with pretty simple changes.

Anyway, if you keep that restriction in mind then you don't need to make the scheduler aware of any changes, because the scheduler pre-calculates and administers the changes in resource availability when it picks and assigns a node for the workload.

Or did you have something else/something particular in mind ?

haosdent commented 1 year ago

@klihub Thanks for your reply

don't need to make the scheduler aware of any changes

I guess in some special cases, we may need to make k8s-scheduler aware of the changes in NRI. For example, NRI allocates NUMA for containers and makes the current node doesn't have insufficient CPU/mem in the same NUMA node although the total CPU/mem is enough.

So I was thinking if Kubernetes side could be aware of NRI changes via such mechanisms, e.g. via the annotation or some other fields in ContainerStatus.

klihub commented 1 year ago

@klihub Thanks for your reply

don't need to make the scheduler aware of any changes

I guess in some special cases, we may need to make k8s-scheduler aware of the changes in NRI. For example, NRI allocates NUMA for containers and makes the current node doesn't have insufficient CPU/mem in the same NUMA node although the total CPU/mem is enough.

So I was thinking if Kubernetes side could be aware of NRI changes via such mechanisms, e.g. via the annotation or some other fields in ContainerStatus.

Yes, this is an existing and independent problem. AFAIK it also hits the existing in-kubelet policies when for instance configured to strict NUMA alignment mode. The core of the problem mostly comes from the scheduler's simplified view of a 'flat resource space' which assumes that sufficient total amount is a guarantee for the satisfiability of resource requests. This is not true if the resources space is not really flat, or when the resources are not orthogonal (IOW allocation of one set of resources has an effect on the amount/availability of other resources and vice versa).

The same problem arises if you try to implement strict full physical core allocation (vs. just allocating X number of hyperthreads worth a full core). And in many other cases when resources are hierarchical or not fully orthogonal. It will need to be solved regardless of what we do here. Sometimes you can do workarounds using extended resources, but often when you have a non-orthogonal resource model that is not enough.

haosdent commented 1 year ago

when you have a non-orthogonal resource model that is not enough.

@klihub True, thanks for the explanations. 👍

mikebrow commented 1 year ago

looking very good just a couple nits on the readme.. another manual run against the now merged containerd/containerd main build with your new NRI service plugin and good to go I think.