NVIDIA / kubevirt-gpu-device-plugin

NVIDIA k8s device plugin for Kubevirt
BSD 3-Clause "New" or "Revised" License
209 stars 66 forks source link

Add GPU name alias feature #96

Open jjacobelli opened 4 months ago

jjacobelli commented 4 months ago

This PR is adding a feature to create GPU name aliases. Using this feature, a user can change the device name that will be announced to the Kubelet. This feature can be configured using a config file added by this PR. The default location of this config file is /etc/kubevirt-gpu-device-plugin/config.yaml, but this location can be overridden using the -config option. The configuration file is using the following format:

GPUAliases:
  - GPUName: <oldDeviceName>
    alias: <newDeviceName>

For example, if a user wants to announce TU104_GEFORCE_RTX_2080 GPUs as rtx2080, the configuration file should look like this:

GPUAliases:
  - GPUName: "TU104_GEFORCE_RTX_2080"
    alias: "rtx2080"

The device plugin will then announce devices that matches TU104_GEFORCE_RTX_2080 as nvidia.com/rtx2080

cdesiniotis commented 4 months ago

@jjacobelli thanks for contributing this PR! I think resource renaming could greatly improve the user experience for pods / VMs which request these resources. Having to know the exact GPU model string ahead of time is a bit cumbersome.

I would argue, however, that generic resource names (that are not tied to a particular GPU model) would provide the most value. For example, nvidia.com/passthrough-gpu for passthrough GPUs and nvidia.com/vgpu for vGPU devices. The exact model name / vGPU profile should be added as a label to the node. If a user wants a particular GPU model, they can configure their nodeSelector accordingly. E.g.

nodeSelector:
  "nvidia.com/passthrough-gpu.model": "TU104_GEFORCE_RTX_2080" 

@jjacobelli @rthallisey I am interested in hearing your thoughts on this. Even if having generic resource names is not the default behavior of the plugin, it would be great if we could configure the plugin to operate in this mode.

@jjacobelli Just as an aside, a similar resource renaming feature was implemented in our k8s-device-plugin. You can read about it here: https://docs.google.com/document/d/1dL67t9IqKC2-xqonMi6DV7W2YNZdkmfX7ibB6Jb-qmk/edit. The feature was implemented, but it is actually disabled entirely in the code as our product team wasn't happy with putting arbitrary resource names in the hands of users. Regardless, I think we could align the API and implementation between the two plugins for resource renaming (if we do decide to add resource renaming to this plugin).

cc @tariq1890 @zvonkok @elezar

rthallisey commented 4 months ago

@cdesiniotis yes, we should align.

@jjacobelli, I don't think your config leaves us much room for expansion. I was thinking something like:

[
  {
    gpu: "1e82",
    gpuProfile: {
       alias: "rtx2080",
    },
  },
  {  
    gpu: "1e87",
    gpuProfile: {
       alias: "rtx2080",
    },
  },
]