Open rhatdan opened 1 year ago
@umohnani8 PTAL
A friendly reminder that this issue had no activity for 30 days.
just want to note that this support would be welcomed, I ran into this issue as noted here: https://github.com/containers/podman/issues/18266
Hello,
I would like to voice support in favor of this feature as well.
(damn, I wish I knew enough about Go and podman's codebase to implement this :cry:)
@ygalblum @umohnani8 Anyone else PTAL @fpoirotte interested in opening a PR?
I'm trying to understand this issue to see if I would be up to the task and I have a few questions and I want to also make sure I understand what is going on here. Basically we want to make podman kube play ....
sensitive to the spec.containers[*].resources.limits
entry in a Podspec and automatically apply the devices listed there in the same way that running podman run --device ....
would?
One point of confusion I have is that in the original comment it says:
One way is to map devices into containers via volumes but this requires privileged containers.
Is this accurate or does it actually depend on the device itself and whether or not elevated permissions are required to interact with it. When tracing through the code I wasn't really seeing that the --device
option was handled much differently than a normal --volume
. In fact unless I'm mistaken --device
is really just a prebaked --volume
with some default options specified for the user (based off of pkg/specgen/generate/config_linux.go
)
--device works much differently in rootless and rootful containers. In rootful containers, it actually creates the device within the containers mountspace with the correct major/minor number and labels the new device with the correct SELinux label. This requires CAP_SYS_MKNOD so it only available to rootful containers.
In rootless containers we just bind mount the device from the host into the container, which is all we can. In some cases the device has an SELinux label, that would prevent it from being used in the container, so you need to disable SELinux use of the device. So in rootless mode there is little different from --device /dev/foobar
and --volume /dev/foobar;/dev/foobar
In certain cases the UID/GID access to the device is not available to processes inside of the container. For example if you have group access based on being in the eng
group. A container will not be in the eng
group, since in rootless mode, there is a user namespace that does not include eng
. In this case you can leak the eng
group access into the container using the --group-add keep-groups
flag.
Okay I see. Here is how I understand this issue so far and I'm still tracing through the code so let me know if I am missing something as I explain my thoughts.
It seems to me that for this change I wouldn't actually need to worry too much about how the devices are setup by the underlying runtimes and instead the task mostly just consists of:
1) Update how Podman understands the K8s PodSpec.Containers[*].Resources.Limits so that a k8s device plugin like name can be recognized 2) When handling handling the various K8s play stuff (Pods, Deployments, Daemonsets) make sure the Device resources in the updated PodSpec get passed into the SpecGen code before it is passed to the underlying runtime / generate.MakeContainer
As long as the information correctly makes it into the SpecGen I think all of the privileged/nonprivileged dev mapping magic is all taken care of for me already
My next question would be since the K8s device plugins work by registering a name to kubelet and there is the whole notion of kubelet + the rpc plugin provider that podman does not have:
I started to try to implement this but I do think I need some clarification on my last question. Without that information I feel that it will be hard to modify the ResourceLimits in a way that makes sense.
In order to hit the
For podman kube play it would be useful to also support this resources.limits to map devices. As this would allow to deploy the same pod spec on both Kubernetes and podman and it would be transparent to the user.
part of this issue the resource limit type, specifically the resource name part, will need to carry extra meaning so that something like
hardware-vendor.nvidia/gpus: 2
gives the podman runtime enough information when generating the container spec to resolve the name to devices on the system. My thought would be to allow the podman user to define a device map of some sort in the system podman configuration so that hardware-vendor.nvidia/gpus -> /dev/nvidia*
(maybe video cards aren't a great example but...)
Also from what I understand podman's configuration files are parsed with code from other repositories (https://github.com/containers/podman/blob/9c954739e9555c0940238f71ba3cc205deaa0e5e/docs/tutorials/rootless_tutorial.md?plain=1#L139C58-L139C94) so that to me means that in order to add the extra device mapping information to podman the configuration file structure would have to be modified in those other repos first
Maybe I'm way off here but thats why I'm seeking clarification :)
@bblenard I think the CDI configuration would contain the necessary info to map the name to the device.
So for example, if we use CDI support from Nvidia
resources:
limits:
nvidia.com/gpu=0:1
What's not yet entirely clear to me is how the CDI devices are reflected in the limits:
section. As there seems to be an indirection via the device plugin. At least this is how I understand KEP-4009: Add CDI devices to device plugin API.
So I think the CDI name is not directly given but the kublet does a request for a a device type, e.g. nvidia.com/gpu: 2
and the device plugin would then return two unique CDI devices (e.g. nvidia.com/gpu=0
and nvidia.com/gpu=1
).
If we say that we allow to directly specify the unique name, as in the example above, we can map it 1:1 to the --device
parameter. I'm not sure if this leads to compatibility problems with Kubernetes tough.
@bachp -- I need to familiarize myself with the things you referenced, just letting you know that I saw your message ( finally ;) )
Okay so update with what I currently understand about how these things relate. As @bachp pointed out the CDI seemingly contains all the information the runtime would need to take a CDI kind(?) and map it to a device path on the system. I also think I see what @bachp is saying with it not quite fitting into the limits section.
I could be wrong but it looks like podman's type for PodSpec.Containers[*].Resources has diverged from Kubernetes' Container.Resources type. Kubernetes' ResourceClaim has an addition field that seems to contain the "name" described in the KEP-4009 referenced above.
I've tried to confirm my understanding by tracing how I believe Kubernetes handles the CDIDevice
message that eventually gets passed to the container runtime.
So currently I would potentially purpose re-syncing Podman's container.resources
type to match Kubernetes and then pass that down to the podspec similar to how I mentioned it earlier. Hopefully that makes sense :)
I'm not sure if additional support needs to be added so podman can handle CDI things (parsing the config under /etc/cdi/
/ /var/run/cdi
) but the Nvidia documentation @bachp linked indicates that podman can handle it.
phew... with all that being said I'd love some input on all that
@bblenard Podman already supports CDI device. You can pass a CDI device name via the --device
command line flag. I tested this with both the nvidia and my own CDI passing in a TTY device.
It also works in podman-compose via the devices
section:
devices:
- nvidia.com/gpu=all
I'm gonna at @rhatdan b/c he is one of the Containers folks that had some activity on this issue previously. Before I dig in and start on this issue properly I just want to make sure someone official is able to add their 2 cents.
Now that we've sorted out the podman support side and the major question (as I see it) is just how implement this into the Podman Resource Spec. Assuming I understand everything correctly I figure we can either:
1) Grab just the fields we need from the K8s ResourceClaim data structure and mirror it in the Podman version (probably the easiest unless we have good test coverage in that part of the codebase) 2) Update the entire Resource data structure to match up with the current K8s version of it (more "unrelated" work but potentially the right way to do it... I'm not sure)
This SGTM.
Okay I have a way forward I think. I'll give this feature my best shot :)
Update:
Still working on this just have been busy lately :)
I have code that I think works in a way that makes sense, I want to clean it up a bit though. Once its in a halfway decent state I'll probably push it up to my fork of podman.
@bblenard If you like to push something early I would love to review and give some feedback.
@bachp -- I added some code here: https://github.com/bblenard/podman/tree/issues-17833-draft
The bulk of the code I wrote is here: pkg/k8s.io/api/resource/dynamic_resources_podman_types.go
My idea was to make the podman kube play
feel like what I understand Kubernetes will be doing with dynamic device stuff based on KEP-4009 and KEP-3063. Obviously the differences here are that Kubernetes has a whole controller ecosystem that would typically handle the details when K8s ingests those types whereas Podman does not.
My "work around" to this was to add an additional argument to the playKube*
functions (dynamicDeviceManager), this type is defined in pkg/k8s.io/api/resource/dynamic_resources_podman_types.go
and is basically a bare bones struct that can tie these K8s types together that would normally be tracked by a resource controller. Other than that once the ResourceClaims are resolved to either a simple device path /dev/....
or a cdi device nvidia.com/gpu=1
those devices are just injected into the specGen.Devices field where podman handles them how it does with any other --device
argument (unless I made some mistake ;) )
Let me know what you think. I was finding it hard to keep things straight in my mind while reviewing KEP-3063
especially following the YAML examples.
PS: One big thing I have to do still is write tests but I have to figure that out still
Any planned movement on this? I recently started down the path of setting up a JellyFin server with HW acceleration (via nvidia and CDI) and prefer the descriptiveness of the kube files for my deployments. Interesting that --device
is easily accessible via run commands but functionality is missing in kube. Anywhere extra hands can used? Or are we just waiting to hear on some form of direction?
@bblenard i just took a look at your code and rebased it onto main feel free to just take the commits from here: https://github.com/robertgzr/podman/tree/issues-17833-draft--main
if i understand correctly, you implemented support for DRA as documented here: https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/, which looks like this when expressed in YAML:
apiVersion: simpledevice.resource.podman.io/v1
kind: ResourceClaimParameters
metadata:
name: kmsg-parameters
spec:
hostpath: /dev/kmsg
---
apiVersion: resource.k8s.io/v1alpha2
kind: ResourceClaimTemplate
metadata:
name: kmsg-template
spec:
resourceClassName: PodmanResourceClass
parametersRef: kmsg-parameters
---
apiVersion: v1
kind: Pod
[...]
resources:
claims:
- name: kmsg
resourceClaims:
- name: kmsg
resourceClaimTemplateName: kmsg-template
That is quite verbose compared to what I imagined (using a podman-specific CDI string):
resources:
limits:
io.podman.device/kmsg: 1
the bits that support the configuration as shared by @bachp https://github.com/containers/podman/issues/17833#issuecomment-1866325061 are also still missing, correct?
Discussed in https://github.com/containers/podman/discussions/14934