containers / podman

Podman: A tool for managing OCI containers and pods.
https://podman.io
Apache License 2.0
23.98k stars 2.43k forks source link

Support device mapping in kube play similar do Kubernetes device plugins #17833

Open rhatdan opened 1 year ago

rhatdan commented 1 year ago

Discussed in https://github.com/containers/podman/discussions/14934

Originally posted by **bachp** July 14, 2022 Kubernetes provides ways to map devices into pods and containers. One way is to map devices into containers via [volumes](https://stackoverflow.com/questions/42697985/kubernetes-node-device-port-usb-mapping-to-pod-or-swarm-service-device-mapp) but this requires privileged containers. The more flexible way to add devices tvia [Kubernetes Device Plugins](https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/) It abstracts the provisioning and mapping of devices to containers. So when specifying a Pod it's only needed to add an abstract name of the hardware, e.g. `hardware-vendor.example/foo` and the amount of it. The device plugin will then take care of making this available to containers. In YAML this will look like: ``` resources: limits: hardware-vendor.example/foo: 2 ``` For `podman kube play` it would be useful to also support this `resources.limits` to map devices. As this would allow to deploy the same pod spec on both Kubernetes and podman and it would be transparent to the user. However I don't think it makes sense to implement the full device plugin interface as it assumes a daemon (kubelet) running. So for podman I think a better option is to implement a kind of hook mechanism that would get `resource.limits` and outputs the required `--devices` parameters, that would be applied before the pod is run.
rhatdan commented 1 year ago

@umohnani8 PTAL

github-actions[bot] commented 1 year ago

A friendly reminder that this issue had no activity for 30 days.

andrew-kennedy commented 1 year ago

just want to note that this support would be welcomed, I ran into this issue as noted here: https://github.com/containers/podman/issues/18266

fpoirotte commented 1 year ago

Hello,

I would like to voice support in favor of this feature as well.

(damn, I wish I knew enough about Go and podman's codebase to implement this :cry:)

rhatdan commented 1 year ago

@ygalblum @umohnani8 Anyone else PTAL @fpoirotte interested in opening a PR?

bblenard commented 1 year ago

I'm trying to understand this issue to see if I would be up to the task and I have a few questions and I want to also make sure I understand what is going on here. Basically we want to make podman kube play .... sensitive to the spec.containers[*].resources.limits entry in a Podspec and automatically apply the devices listed there in the same way that running podman run --device .... would?

One point of confusion I have is that in the original comment it says:

One way is to map devices into containers via volumes but this requires privileged containers.

Is this accurate or does it actually depend on the device itself and whether or not elevated permissions are required to interact with it. When tracing through the code I wasn't really seeing that the --device option was handled much differently than a normal --volume. In fact unless I'm mistaken --device is really just a prebaked --volume with some default options specified for the user (based off of pkg/specgen/generate/config_linux.go)

rhatdan commented 1 year ago

--device works much differently in rootless and rootful containers. In rootful containers, it actually creates the device within the containers mountspace with the correct major/minor number and labels the new device with the correct SELinux label. This requires CAP_SYS_MKNOD so it only available to rootful containers.

In rootless containers we just bind mount the device from the host into the container, which is all we can. In some cases the device has an SELinux label, that would prevent it from being used in the container, so you need to disable SELinux use of the device. So in rootless mode there is little different from --device /dev/foobar and --volume /dev/foobar;/dev/foobar

In certain cases the UID/GID access to the device is not available to processes inside of the container. For example if you have group access based on being in the eng group. A container will not be in the eng group, since in rootless mode, there is a user namespace that does not include eng. In this case you can leak the eng group access into the container using the --group-add keep-groups flag.

bblenard commented 1 year ago

Okay I see. Here is how I understand this issue so far and I'm still tracing through the code so let me know if I am missing something as I explain my thoughts.

It seems to me that for this change I wouldn't actually need to worry too much about how the devices are setup by the underlying runtimes and instead the task mostly just consists of:

1) Update how Podman understands the K8s PodSpec.Containers[*].Resources.Limits so that a k8s device plugin like name can be recognized 2) When handling handling the various K8s play stuff (Pods, Deployments, Daemonsets) make sure the Device resources in the updated PodSpec get passed into the SpecGen code before it is passed to the underlying runtime / generate.MakeContainer

As long as the information correctly makes it into the SpecGen I think all of the privileged/nonprivileged dev mapping magic is all taken care of for me already

My next question would be since the K8s device plugins work by registering a name to kubelet and there is the whole notion of kubelet + the rpc plugin provider that podman does not have:

bblenard commented 1 year ago

I started to try to implement this but I do think I need some clarification on my last question. Without that information I feel that it will be hard to modify the ResourceLimits in a way that makes sense.

In order to hit the

For podman kube play it would be useful to also support this resources.limits to map devices. As this would allow to deploy the same pod spec on both Kubernetes and podman and it would be transparent to the user.

part of this issue the resource limit type, specifically the resource name part, will need to carry extra meaning so that something like

hardware-vendor.nvidia/gpus: 2

gives the podman runtime enough information when generating the container spec to resolve the name to devices on the system. My thought would be to allow the podman user to define a device map of some sort in the system podman configuration so that hardware-vendor.nvidia/gpus -> /dev/nvidia* (maybe video cards aren't a great example but...)

Also from what I understand podman's configuration files are parsed with code from other repositories (https://github.com/containers/podman/blob/9c954739e9555c0940238f71ba3cc205deaa0e5e/docs/tutorials/rootless_tutorial.md?plain=1#L139C58-L139C94) so that to me means that in order to add the extra device mapping information to podman the configuration file structure would have to be modified in those other repos first

Maybe I'm way off here but thats why I'm seeking clarification :)

bachp commented 11 months ago

@bblenard I think the CDI configuration would contain the necessary info to map the name to the device.

So for example, if we use CDI support from Nvidia

resources:
  limits:
    nvidia.com/gpu=0:1

What's not yet entirely clear to me is how the CDI devices are reflected in the limits: section. As there seems to be an indirection via the device plugin. At least this is how I understand KEP-4009: Add CDI devices to device plugin API. So I think the CDI name is not directly given but the kublet does a request for a a device type, e.g. nvidia.com/gpu: 2 and the device plugin would then return two unique CDI devices (e.g. nvidia.com/gpu=0 and nvidia.com/gpu=1).

If we say that we allow to directly specify the unique name, as in the example above, we can map it 1:1 to the --device parameter. I'm not sure if this leads to compatibility problems with Kubernetes tough.

bblenard commented 11 months ago

@bachp -- I need to familiarize myself with the things you referenced, just letting you know that I saw your message ( finally ;) )

bblenard commented 10 months ago

Okay so update with what I currently understand about how these things relate. As @bachp pointed out the CDI seemingly contains all the information the runtime would need to take a CDI kind(?) and map it to a device path on the system. I also think I see what @bachp is saying with it not quite fitting into the limits section.

I could be wrong but it looks like podman's type for PodSpec.Containers[*].Resources has diverged from Kubernetes' Container.Resources type. Kubernetes' ResourceClaim has an addition field that seems to contain the "name" described in the KEP-4009 referenced above.

I've tried to confirm my understanding by tracing how I believe Kubernetes handles the CDIDevice message that eventually gets passed to the container runtime.

So currently I would potentially purpose re-syncing Podman's container.resources type to match Kubernetes and then pass that down to the podspec similar to how I mentioned it earlier. Hopefully that makes sense :)

I'm not sure if additional support needs to be added so podman can handle CDI things (parsing the config under /etc/cdi/ / /var/run/cdi) but the Nvidia documentation @bachp linked indicates that podman can handle it.

phew... with all that being said I'd love some input on all that

bachp commented 10 months ago

@bblenard Podman already supports CDI device. You can pass a CDI device name via the --device command line flag. I tested this with both the nvidia and my own CDI passing in a TTY device.

It also works in podman-compose via the devices section:

devices:
  - nvidia.com/gpu=all
bblenard commented 10 months ago

I'm gonna at @rhatdan b/c he is one of the Containers folks that had some activity on this issue previously. Before I dig in and start on this issue properly I just want to make sure someone official is able to add their 2 cents.

Now that we've sorted out the podman support side and the major question (as I see it) is just how implement this into the Podman Resource Spec. Assuming I understand everything correctly I figure we can either:

1) Grab just the fields we need from the K8s ResourceClaim data structure and mirror it in the Podman version (probably the easiest unless we have good test coverage in that part of the codebase) 2) Update the entire Resource data structure to match up with the current K8s version of it (more "unrelated" work but potentially the right way to do it... I'm not sure)

rhatdan commented 10 months ago

This SGTM.

bblenard commented 9 months ago

Okay I have a way forward I think. I'll give this feature my best shot :)

bblenard commented 8 months ago

Update:

Still working on this just have been busy lately :)

I have code that I think works in a way that makes sense, I want to clean it up a bit though. Once its in a halfway decent state I'll probably push it up to my fork of podman.

bachp commented 8 months ago

@bblenard If you like to push something early I would love to review and give some feedback.

bblenard commented 8 months ago

@bachp -- I added some code here: https://github.com/bblenard/podman/tree/issues-17833-draft

The bulk of the code I wrote is here: pkg/k8s.io/api/resource/dynamic_resources_podman_types.go

My idea was to make the podman kube play feel like what I understand Kubernetes will be doing with dynamic device stuff based on KEP-4009 and KEP-3063. Obviously the differences here are that Kubernetes has a whole controller ecosystem that would typically handle the details when K8s ingests those types whereas Podman does not.

My "work around" to this was to add an additional argument to the playKube* functions (dynamicDeviceManager), this type is defined in pkg/k8s.io/api/resource/dynamic_resources_podman_types.go and is basically a bare bones struct that can tie these K8s types together that would normally be tracked by a resource controller. Other than that once the ResourceClaims are resolved to either a simple device path /dev/.... or a cdi device nvidia.com/gpu=1 those devices are just injected into the specGen.Devices field where podman handles them how it does with any other --device argument (unless I made some mistake ;) )

Let me know what you think. I was finding it hard to keep things straight in my mind while reviewing KEP-3063 especially following the YAML examples.

PS: One big thing I have to do still is write tests but I have to figure that out still

CircuitCipher commented 2 weeks ago

Any planned movement on this? I recently started down the path of setting up a JellyFin server with HW acceleration (via nvidia and CDI) and prefer the descriptiveness of the kube files for my deployments. Interesting that --device is easily accessible via run commands but functionality is missing in kube. Anywhere extra hands can used? Or are we just waiting to hear on some form of direction?

robertgzr commented 1 week ago

@bblenard i just took a look at your code and rebased it onto main feel free to just take the commits from here: https://github.com/robertgzr/podman/tree/issues-17833-draft--main

if i understand correctly, you implemented support for DRA as documented here: https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/, which looks like this when expressed in YAML:

apiVersion: simpledevice.resource.podman.io/v1
kind: ResourceClaimParameters
metadata:
  name: kmsg-parameters
spec:
  hostpath: /dev/kmsg  
---
apiVersion: resource.k8s.io/v1alpha2
kind: ResourceClaimTemplate
metadata:
  name: kmsg-template
spec:
  resourceClassName: PodmanResourceClass
  parametersRef: kmsg-parameters
---
apiVersion: v1
kind: Pod
[...]
resources:
  claims:
  - name: kmsg
resourceClaims:
- name: kmsg
  resourceClaimTemplateName: kmsg-template

That is quite verbose compared to what I imagined (using a podman-specific CDI string):

resources:
  limits:
    io.podman.device/kmsg: 1

the bits that support the configuration as shared by @bachp https://github.com/containers/podman/issues/17833#issuecomment-1866325061 are also still missing, correct?