Closed RenaudWasTaken closed 1 year ago
Cc @crosbymichael
thanks, checking it out!
This looks like you are understanding everything correctly. What is your preference in how you would like this to work?
I think both approaches are valid. I think for the first section, NRI is also invoked on all Updates
as well so that should resolve your issue about the upper layers not being aware of the new devices. For 2. Are you saying that CDI users would add a mount to the spec with that path?
Hey, good news. I hit the issue where if a container gets an update call after an nvidia plugin runs, it clears out the cgroup devices rules. I'm going to see how we could either solve this with NRI or other updates to the low level components
This looks like you are understanding everything correctly. What is your preference in how you would like this to work?
In my mind, the better solution is to pass to containerd the OCI changes that are requested. Operations that change the container sandbox are typically operations I want the container runtime to be aware of, so that it doesn't override my change accidentally.
Ya, so ideally, we need a "pre-create" type step in the lifecycle so that a set of nri plugins have a chance to make modifications to the runtime-spec before a container is created. Is this correct?
Sorry for the lag I missed your message. A pre-create step is where I suspect we should be going for this.
/cc
containerd and CRI-O have both merged PRs for native/built-in CDI support which obsolete this issue.
Hello there!
For some context, this issue is part of the effort to implement Container Device Interface (CDI) using NRI :) !
For context on CDI: If a vendor wants to add support for its device type, it creates a file (e.g: /etc/cdi/vendor.json) that specifies, the device type, the devices available on the machine and the actions a vendor must perform. Quick example:
Below are a few approaches in how I envision CDI working with NRI, let me know if there are different approaches or if my understanding is incorrect.
CDI on top of NRI today.
TLDR: When the CRI-containerd calls StartContainer, the CDI plugin gets invoked, reads the Devices requested by the container (in the
spec.resources
field) and applies the operations it reads from/etc/cdi/vendor.json
(e.g:mount
,mknod
, cgroups, ...).There's a few challenges with this approach:
Update
where if a container is using a device it will loose access to it (at least until the CDI plugin fixes it).spec.resources.Linux.Devices
field which would contain an entry whose path would bevendor.com/device=myDevice
. What would happen today is that containerd would blow up either before or after the NRI call because that device is invalid.OCI specification is passed up and down NRI plugins.
TLDR: When the CRI-containerd calls CreateContainer, the CDI plugin gets invoked, reads the Devices requested by the container, reads the the operations from
/etc/cdi/vendor.json
and applies changes to the OCI spec (including removing the CDI devices from the OCI spec).There's a few challenges with this approach:
Thanks for reading this far! Let me know what you think :) !