NVIDIA / k8s-device-plugin

NVIDIA device plugin for Kubernetes
Apache License 2.0
2.73k stars 615 forks source link

Resources are not split when using “time slicing” with the NVIDIA device plugin for Kubernetes #990

Open y-shida-tg opened 1 day ago

y-shida-tg commented 1 day ago

Referring to “GitHub - NVIDIA/k8s-device-plugin: NVIDIA device plugin for Kubernetes”, we have implemented the " NVIDIA device plugin for Kubernetes" and are trying out time slicing, but encountering issues. Specifically, the GPU capacity is displayed as follows, with only “1” GPU capacity shown instead of “4” (expected to be 4 due to replicas: 4 in the YAML). What could be the reason why “Capacity” is not increasing?

# kubectl describe node test-server
Capacity:
  nvidia.com/gpu: 1
  nvidia.com/gpu: 1

times.yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: device-plugin-config
data:
  time-sliced: |-
    version: v1
    sharing:
      timeSlicing:
        resources:
        - name: nvidia.com/gpu
          replicas: 4

Hardware Information: Server: PowerEdge R750 (SKU=090E, ModelName=PowerEdge R750) CPU: Intel(R) Xeon(R) Gold 6330 CPU @ 2.00GHz

GPGPU Information: GPGPU: A100 80GB CUDA Version: 12.2 Driver Version: 535.54.03 nvidia-container-runtime: runc version 1.0.2、spec: 1.0.2-dev、go: go1.16.7、libseccomp: 2.5.1

Linux Information: OS: CentOS Linux release 8.5.2111 k8s environment: kubectl version: Client Version: version.Info{Major: “1”, Minor: “23”, GitVersion: “v1.23.6”, GitCommit: “ad3338546da947756e8a88aa6822e9c11e7eac22”, GitTreeState: “clean”, BuildDate: “2022-04-14T08:49:13Z”, GoVersion: “go1.17.9”, Compiler: “gc”, Platform: “linux/amd64”} Server Version: version.Info{Major: “1”, Minor: “23”, GitVersion: “v1.23.17”, GitCommit: “953be8927218ec8067e1af2641e540238ffd7576”, GitTreeState: “clean”, BuildDate: “2023-02-22T13:27:46Z”, GoVersion: “go1.19.6”, Compiler: “gc”, Platform: “linux/amd64”} crio version: 1.23.5

NVIDIA device plugin for Kubernetes version used: v0.16.1

klueska commented 1 day ago

The only reason this would happen is if your plugin on the node isn't actually pointing to this config. Did you launch the plugin pointing to this config map and then update the label on the node to point to the particular time-slicing config within that config map?

https://github.com/NVIDIA/k8s-device-plugin/tree/main?tab=readme-ov-file#multiple-config-file-example