NVIDIA / gpu-operator

NVIDIA GPU Operator creates/configures/manages GPUs atop Kubernetes
Apache License 2.0
1.74k stars 281 forks source link

Enhancement validateGPUResource method #829

Open lengrongfu opened 1 month ago

lengrongfu commented 1 month ago

In this method, two resource prefixes nvidia.com/mig- and nvidia.com/gpu are checked on the node. However, you can customize the resourceName when deploying the device-plugin component, such as when using MPS. So this won't work as expected.

version: v1
sharing:
  mps:
    resources:
    - name: nvidia.com/mps-gpu
      replicas: 10

https://github.com/NVIDIA/gpu-operator/blob/d4316a415bbd684ce8416a88042305fc1a093aa4/validator/main.go#L1240C18-L1240C37

We hope we add a availablePrefixResourceName method, it only check nvidia.com prefix, because device-plugin have a fixed prefix https://github.com/NVIDIA/k8s-device-plugin/blob/5144aa02c97b4aba8c0d118ccee9748834458d21/api/config/v1/consts.go#L25

func (p *Plugin) availablePrefixResourceName(resources v1.ResourceList) v1.ResourceName {
    for resourceName, quantity := range resources {
        if strings.HasPrefix(string(resourceName), "nvidia.com") && quantity.Value() >= 1 {
            log.Debugf("Found GPU resource name %s quantity %d", resourceName, quantity.Value())
            return resourceName
        }
    }
    return ""
}
lengrongfu commented 1 month ago

If you agree, I will submit the code to the gitlab repository.

cdesiniotis commented 1 month ago

However, you can customize the resourceName when deploying the device-plugin component, such as when using MPS.

Can you clarify how you are changing the resource name? Resource renaming is currently not supported in the device-plugin. Additionally, when enabling MPS, the resource name used is still nvidia.com/gpu.

lengrongfu commented 1 month ago

However, you can customize the resourceName when deploying the device-plugin component, such as when using MPS.

Can you clarify how you are changing the resource name? Resource renaming is currently not supported in the device-plugin. Additionally, when enabling MPS, the resource name used is still nvidia.com/gpu.

Oh, I saw this code. Because we forked device-plugin a long time ago for secondary development, we can still modify the resource name.

lengrongfu commented 1 month ago

@cdesiniotis For some historical reasons, our resource names can be modified, but still prefixed with nvidia.com, so can gpu-operator be compatible with this situation?