Question: how to select an specific GPU type? Can I use a different name for the resources, other than `nvidia.com/gpu`?

abravalheri commented 3 months ago

Hello I was trying to follow the documentation in https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-sharing.html#applying-multiple-node-specific-configurations and figure out how to deploy workloads on specific GPUs.

For example, let's assume that I have followed the docs using the following settings:

apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config-fine
data:
  a100-40gb: |-
    version: v1
    flags:
      migStrategy: none
    sharing:
      timeSlicing:
        resources:
        - name: nvidia.com/gpu
          replicas: 8
  tesla-t4: |-
    version: v1
    flags:
      migStrategy: none
    sharing:
      timeSlicing:
        resources:
        - name: nvidia.com/gpu
          replicas: 4

How to I write my pod specification so that I gets scheduled to use the tesla-t4 instead of the a100-40gb? If I simply write resources: {limits: {nvidia.com/gpu: 1}} the pod will get scheduled for any of the GPUs that is available right? How can I specify which one I want to use?

As a wild guess, I tried using a different name for the resource (e.g. nvidia.com/t4-ts4), but it did not seem to work. So I imagine there is a different mechanism for that...

Is there any documentation that explains how to achieve that?

cdesiniotis commented 3 months ago

GPU Feature Discovery, a daemonset deployed by GPU Operator, will label your GPU nodes with GPU related information. Describe you GPU nodes and you will see a number of labels with the nvidia.com/ prefix. You can leverage these labels as node selectors in your pod spec to better control where your pod gets scheduled to. I would recommend using the nvidia.com/gpu.product label for your use case.

abravalheri commented 3 months ago

Thank you! That seems to work.

NVIDIA / gpu-operator

Question: how to select an specific GPU type? Can I use a different name for the resources, other than `nvidia.com/gpu`? #897