NVIDIA / k8s-device-plugin

NVIDIA device plugin for Kubernetes
Apache License 2.0
2.82k stars 627 forks source link

How to use specific NVIDIA GPU type(model) in pod yaml #143

Closed estherxyz closed 4 years ago

estherxyz commented 5 years ago

1. Issue or feature description

There are different nvidia gpu model install on k8s cluster(or single machine). How could I specific different NVIDIA GPU type in k8s yaml?

Because different GPU type maybe use different cuda version. And docker image need to support same cuda version with GPU device.

2. Steps to reproduce the issue

3. Information to attach (optional if deemed irrelevant)

Common error checking:

==============NVSMI LOG==============

Timestamp                           : Wed Oct 23 13:55:31 2019
Driver Version                      : 435.21
CUDA Version                        : 10.1

Attached GPUs                       : 1
GPU 00000000:01:00.0
    Product Name                    : GeForce GTX 1080 Ti
    Product Brand                   : GeForce
    Display Mode                    : Disabled
    Display Active                  : Disabled
    Persistence Mode                : Disabled
    Accounting Mode                 : Disabled
    Accounting Mode Buffer Size     : 4000
    Driver Model
        Current                     : N/A
        Pending                     : N/A
    Serial Number                   : N/A
    GPU UUID                        : GPU-c93652fd-383f-3e3b-dd65-ce5b718eee3e
    Minor Number                    : 0
    VBIOS Version                   : 86.02.39.40.19
    MultiGPU Board                  : No
    Board ID                        : 0x100
    GPU Part Number                 : N/A
    Inforom Version
        Image Version               : G001.0000.01.04
        OEM Object                  : 1.1
        ECC Object                  : N/A
        Power Management Object     : N/A
    GPU Operation Mode
        Current                     : N/A
        Pending                     : N/A
    GPU Virtualization Mode
        Virtualization mode         : None
    IBMNPU
        Relaxed Ordering Mode       : N/A
    PCI
        Bus                         : 0x01
        Device                      : 0x00
        Domain                      : 0x0000
        Device Id                   : 0x1B0610DE
        Bus Id                      : 00000000:01:00.0
        Sub System Id               : 0x376B1458
        GPU Link Info
            PCIe Generation
                Max                 : 3
                Current             : 1
            Link Width
                Max                 : 16x
                Current             : 8x
        Bridge Chip
            Type                    : N/A
            Firmware                : N/A
        Replays Since Reset         : 0
        Replay Number Rollovers     : 0
        Tx Throughput               : 0 KB/s
        Rx Throughput               : 0 KB/s
    Fan Speed                       : 21 %
    Performance State               : P8
    Clocks Throttle Reasons
        Idle                        : Active
        Applications Clocks Setting : Not Active
        SW Power Cap                : Not Active
        HW Slowdown                 : Not Active
            HW Thermal Slowdown     : Not Active
            HW Power Brake Slowdown : Not Active
        Sync Boost                  : Not Active
        SW Thermal Slowdown         : Not Active
        Display Clock Setting       : Not Active
    FB Memory Usage
        Total                       : 11176 MiB
        Used                        : 0 MiB
        Free                        : 11176 MiB
    BAR1 Memory Usage
        Total                       : 256 MiB
        Used                        : 2 MiB
        Free                        : 254 MiB
    Compute Mode                    : Default
    Utilization
        Gpu                         : 0 %
        Memory                      : 0 %
        Encoder                     : 0 %
        Decoder                     : 0 %
    Encoder Stats
        Active Sessions             : 0
        Average FPS                 : 0
        Average Latency             : 0
    FBC Stats
        Active Sessions             : 0
        Average FPS                 : 0
        Average Latency             : 0
    Ecc Mode
        Current                     : N/A
        Pending                     : N/A
    ECC Errors
        Volatile
            Single Bit
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                CBU                 : N/A
                Total               : N/A
            Double Bit
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                CBU                 : N/A
                Total               : N/A
        Aggregate
            Single Bit
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                CBU                 : N/A
                Total               : N/A
            Double Bit
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                CBU                 : N/A
                Total               : N/A
    Retired Pages
        Single Bit ECC              : N/A
        Double Bit ECC              : N/A
        Pending Page Blacklist      : N/A
    Temperature
        GPU Current Temp            : 31 C
        GPU Shutdown Temp           : 96 C
        GPU Slowdown Temp           : 93 C
        GPU Max Operating Temp      : N/A
        Memory Current Temp         : N/A
        Memory Max Operating Temp   : N/A
    Power Readings
        Power Management            : Supported
        Power Draw                  : 19.61 W
        Power Limit                 : 250.00 W
        Default Power Limit         : 250.00 W
        Enforced Power Limit        : 250.00 W
        Min Power Limit             : 125.00 W
        Max Power Limit             : 300.00 W
    Clocks
        Graphics                    : 139 MHz
        SM                          : 139 MHz
        Memory                      : 405 MHz
        Video                       : 544 MHz
    Applications Clocks
        Graphics                    : N/A
        Memory                      : N/A
    Default Applications Clocks
        Graphics                    : N/A
        Memory                      : N/A
    Max Clocks
        Graphics                    : 1911 MHz
        SM                          : 1911 MHz
        Memory                      : 5505 MHz
        Video                       : 1620 MHz
    Max Customer Boost Clocks
        Graphics                    : N/A
    Clock Policy
        Auto Boost                  : N/A
        Auto Boost Default          : N/A
    Processes                       : None
{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

Additional information that might help better understand your environment and reproduce the bug:

Client:
 Version:           18.09.7
 API version:       1.39
 Go version:        go1.10.1
 Git commit:        2d0083d
 Built:             Fri Aug 16 14:20:06 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server:
 Engine:
  Version:          18.09.7
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.10.1
  Git commit:       2d0083d
  Built:            Wed Aug 14 19:41:23 2019
  OS/Arch:          linux/amd64
  Experimental:     false
Linux 4.15.0-55-generic #60-Ubuntu SMP Tue Jul 2 18:22:20 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
version: 1.0.5
build date: 2019-09-06T16:59+00:00
build revision: 13b836390888f7b7c7dca115d16d7e28ab15a836
build compiler: x86_64-linux-gnu-gcc-7 7.4.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections
RenaudWasTaken commented 4 years ago

Hello!

There are different nvidia gpu model install on k8s cluster(or single machine). How could I specific different NVIDIA GPU type in k8s yaml?

You will need to tag them with a label (e.g: nvidia.com/gpu.family) and can then use nodeSelectors in your pod spec to specify certain nodes.

Take a look at the GPU feature discovery that can automatically label nodes for you: https://github.com/NVIDIA/gpu-feature-discovery

pbxqdown commented 2 years ago

I have a single node cluster with multi type of GPUs(one 3070, one 3080) installed. In this case nodeSelectors solution would not work. Is it possible that the k8s device plugin distinguish different GPU resource types?

klueska commented 2 years ago

@pbxqdown At the moment this is not possible. Only a single GPU type per node is supported. However, we are planning to add support for this in the coming months. Stay tuned.

pbxqdown commented 2 years ago

@klueska Thanks, this is awesome! Let me know if I can be of any help with testing or something.

dasantonym commented 2 years ago

@klueska Are there any news on this feature or is there another issue to watch for this? We have one node with mixed GPU types and it would be great to have this granularity when requesting resources.

jasine commented 1 year ago

any progress?

klueska commented 1 year ago

We had added support about 6 months ago to allow such setups to be detected and allow users to assign a different resource name to each of them (i.e. nvidia.com/rtx-2080 vs nvidia.com/rtx-3090), but it got reverted because our product team wasn’t happy putting arbitrary resource naming in the hands of users.

This is how it would have worked: https://docs.google.com/document/d/1dL67t9IqKC2-xqonMi6DV7W2YNZdkmfX7ibB6Jb-qmk/edit

jasine commented 1 year ago

We had added support about 6 months ago to allow such setups to be detected and allow users to assign a different resource name to each of them (i.e. nvidia.com/rtx-2080 vs nvidia.com/rtx-3090), but it got reverted because our product team wasn’t happy putting arbitrary resource naming in the hands of users.

This is how it would have worked: https://docs.google.com/document/d/1dL67t9IqKC2-xqonMi6DV7W2YNZdkmfX7ibB6Jb-qmk/edit

so, this feature will not supported in the feature or new plan is working on?