NVIDIA / gpu-operator

NVIDIA GPU Operator creates, configures, and manages GPUs in Kubernetes
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html
Apache License 2.0
1.87k stars 304 forks source link

Allocatable gpu value not correct after configuring time slicing #684

Open shashiranjan84 opened 8 months ago

shashiranjan84 commented 8 months ago

Allocatable gpu values not correct after configuring time slicing

apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config
data:
  any: |-
    version: v1
    flags:
      migStrategy: none
    sharing:
      timeSlicing:
        renameByDefault: false
        failRequestsGreaterThanOne: false
        resources:
          - name: nvidia.com/gpu
            replicas: 12
Capacity:
  cpu:                8
  ephemeral-storage:  209702892Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             32386524Ki
  nvidia.com/gpu:     12
  pods:               29
Allocatable:
  cpu:                7910m
  ephemeral-storage:  192188443124
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             31696348Ki
  nvidia.com/gpu:     12
  pods:               29

Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests       Limits
  --------           --------       ------
  cpu                250m (3%)      0 (0%)
  memory             10310Mi (33%)  10410Mi (33%)
  ephemeral-storage  0 (0%)         0 (0%)
  hugepages-1Gi      0 (0%)         0 (0%)
  hugepages-2Mi      0 (0%)         0 (0%)
  nvidia.com/gpu     4     

Relevant node labels

kubernetes.io/arch=amd64
kubernetes.io/hostname=ip-172-17-12-22.eu-west-1.compute.internal
kubernetes.io/os=linux
node.kubernetes.io/instance-type=g4dn.2xlarge
nvidia.com/cuda.driver.major=535
nvidia.com/cuda.driver.minor=161
nvidia.com/cuda.driver.rev=07
nvidia.com/cuda.runtime.major=12
nvidia.com/cuda.runtime.minor=2
nvidia.com/gfd.timestamp=1711077403
nvidia.com/gpu.compute.major=7
nvidia.com/gpu.compute.minor=5
nvidia.com/gpu.count=1
nvidia.com/gpu.deploy.container-toolkit=true
nvidia.com/gpu.deploy.dcgm=true
nvidia.com/gpu.deploy.dcgm-exporter=true
nvidia.com/gpu.deploy.device-plugin=true
nvidia.com/gpu.deploy.driver=true
nvidia.com/gpu.deploy.gpu-feature-discovery=true
nvidia.com/gpu.deploy.node-status-exporter=true
nvidia.com/gpu.deploy.operator-validator=true
nvidia.com/gpu.family=turing
nvidia.com/gpu.machine=g4dn.2xlarge
nvidia.com/gpu.memory=15360
nvidia.com/gpu.present=true
nvidia.com/gpu.product=Tesla-T4-SHARED
nvidia.com/gpu.replicas=12
nvidia.com/mig.capable=false
nvidia.com/mig.strategy=single

Pod status

NAME                                                              READY   STATUS      RESTARTS   AGE
eu-west-1-dd-datadog-9fjwr                                    3/3     Running     0          4h2m
eu-west-1-dd-datadog-cluster-agent-79dbdcdd75-mt97s           1/1     Running     0          4h2m
eu-west-1-dd-datadog-s7527                                    3/3     Running     0          4h1m
eu-west-1-dd-kube-state-metrics-5b7b7bb44-mz4zm               1/1     Running     0          7d5h
eu-west-1-prod-gpuo-node-feature-discovery-gc-5b8848fhhpfw   1/1     Running     0          5h7m
eu-west-1-prod-gpuo-node-feature-discovery-master-58dt88ct   1/1     Running     0          5h7m
eu-west-1-prod-gpuo-node-feature-discovery-worker-r5pb2      1/1     Running     0          5h9m
eu-west-1-prod-gpuo-node-feature-discovery-worker-xgtst      1/1     Running     0          5h9m
gpu-feature-discovery-j72bc                                       2/2     Running     0          21m
gpu-feature-discovery-sqlc6                                       2/2     Running     0          20m
gpu-operator-675d95bdb9-zdhgw                                     1/1     Running     0          5h9m
nvidia-cuda-validator-cxr72                                       0/1     Completed   0          5h7m
nvidia-cuda-validator-pfnvx                                       0/1     Completed   0          5h7m
nvidia-dcgm-exporter-842sq                                        1/1     Running     0          5h7m
nvidia-dcgm-exporter-nvzn8                                        1/1     Running     0          5h7m
nvidia-device-plugin-daemonset-dlj4s                              2/2     Running     0          59m
nvidia-device-plugin-daemonset-xkpqp                              2/2     Running     0          59m
nvidia-operator-validator-ncgkr                                   1/1     Running     0          5h7m
nvidia-operator-validator-zrqd9                                   1/1     Running     0          5h7m

1. Quick Debug Information

Kernel Version: 5.10.210-201.852.amzn2.x86_64 OS Image: Amazon Linux 2 Operating System: linux Architecture: amd64 Container Runtime Version: containerd://1.7.11 Kubelet Version: v1.29.0-eks-5e0fdde Kube-Proxy Version: v1.29.0-eks-5e0fdde gpu-operator v23.9.2

2. Issue or feature description

Allocatable GPU should be 8

shivamerla commented 8 months ago

@shashiranjan84 in your time-slicing config you have set replicas: 12, hence those many GPUs are reported as allocatable. Not sure why you think it should be 8?

shashiranjan84 commented 8 months ago

@shashiranjan84 in your time-slicing config you have set replicas: 12, hence those many GPUs are reported as allocatable. Not sure why you think it should be 8?

Shouldn't be it showing how much gpu left? Then what is the difference b/w capacity and allocatable ? As 4 gpu already allocated, I thought allocatable should be 8, no?

shivamerla commented 8 months ago

@shashiranjan84 sorry i missed that you are running pods using GPUs. Yes, it should have been reflected. @klueska any thoughts?

Assuming below are the pods are using GPUs?

eu-west-1-dd-datadog-9fjwr                                    3/3     Running     0          4h2m
eu-west-1-dd-datadog-cluster-agent-79dbdcdd75-mt97s           1/1     Running     0          4h2m
eu-west-1-dd-datadog-s7527                                    3/3     Running     0          4h1m
eu-west-1-dd-kube-state-metrics-5b7b7bb44-mz4zm               1/1     Running     0          7d5h
shashiranjan84 commented 8 months ago

Yes there is one pod running with GPU which can be seen in allocated resource

Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests       Limits
  --------           --------       ------
  cpu                250m (3%)      0 (0%)
  memory             10310Mi (33%)  10410Mi (33%)
  ephemeral-storage  0 (0%)         0 (0%)
  hugepages-1Gi      0 (0%)         0 (0%)
  hugepages-2Mi      0 (0%)         0 (0%)
  nvidia.com/gpu     4  
msherm2 commented 2 months ago

I am seeing this issue as well as not seeing the proper accounting for the allocated on this version of GPU operator. Kernel Version: 4.18.0-513.24.1.el8_9.x86_64 OS Image: RHEL 9.3 Operating System: linux Architecture: x86_64 Container Runtime Version: containerd://1.7.11 Kubelet Version: v1.26.15+rke2r1 Kube-Proxy Version: v1.26.15+rke2r1 gpu-operator v24.6.0 NVIDIA-SMI 560.28.03 Driver Version: 560.28.03 CUDA Version: 12.6

I have confirmed beyond a doubt that the pod is using the GPU but the allocated resources count is not updating with the usage. This is the first time I have seen this issue. In GPU Operator 23.9.1 I did not see this problem.

Capacity: cpu: 8 ephemeral-storage: 209702892Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 32166136Ki nvidia.com/gpu: 4 pods: 110 Allocatable: cpu: 8 ephemeral-storage: 203998973178 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 32166136Ki nvidia.com/gpu: 4 pods: 110 Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits


cpu 3075m (38%) 1100m (13%) memory 11214Mi (35%) 28202Mi (89%) ephemeral-storage 0 (0%) 0 (0%) hugepages-1Gi 0 (0%) 0 (0%) hugepages-2Mi 0 (0%) 0 (0%) nvidia.com/gpu 0 0 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 560.28.03 Driver Version: 560.28.03 CUDA Version: 12.6 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 Tesla T4 Off | 00000000:00:1E.0 Off | 0 | | N/A 34C P0 27W / 70W | 1467MiB / 15360MiB | 1% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 1708723 C python3 1464MiB | +-----------------------------------------------------------------------------------------+