NVIDIA / dcgm-exporter

NVIDIA GPU metrics exporter for Prometheus leveraging DCGM
Apache License 2.0
916 stars 157 forks source link

DCGM-exporter pods stuck in Running State, Not getting Ready without GPU allocation. #385

Open rohitreddy1698 opened 2 months ago

rohitreddy1698 commented 2 months ago

Ask your question

Hi Team,

I am using the dcgm-exporter, installed as a Helm Chart. I am using the default values.

I have other Milvus component pods : query node and index node successfully scheduled and in READY state running on GPU pods. Logging onto pod and running nvidia-smi command is successful.

But the dcgm-exporter daemon set pods are stuck in error state :

` ➜ VectorDBBench git:(main) ✗ kubectl get pods
NAME READY STATUS RESTARTS AGE dcgm-exporter-btsln 0/1 Running 0 46s dcgm-exporter-c8gpg 0/1 Running 0 46s dcgm-exporter-f9jd7 0/1 Running 0 46s dcgm-exporter-xhs2v 0/1 Running 0 46s dcgm-exporter-z4pz4 0/1 Running 0 46s dcgm-exporter-zh854 0/1 Running 0 46s vectordbbench-deployment-cf7974db6-r5scd 1/1 Running 0 41h ➜ VectorDBBench git:(main) ✗

➜ VectorDBBench git:(main) ✗ kubectl logs dcgm-exporter-btsln
2024/09/03 10:31:11 maxprocs: Leaving GOMAXPROCS=8: CPU quota undefined time="2024-09-03T10:31:11Z" level=info msg="Starting dcgm-exporter" time="2024-09-03T10:31:11Z" level=info msg="DCGM successfully initialized!" time="2024-09-03T10:31:12Z" level=info msg="Not collecting DCP metrics: This request is serviced by a module of DCGM that is not currently loaded" time="2024-09-03T10:31:12Z" level=info msg="Falling back to metric file '/etc/dcgm-exporter/dcp-metrics-included.csv'" time="2024-09-03T10:31:12Z" level=warning msg="Skipping line 19 ('DCGM_FI_PROF_GR_ENGINE_ACTIVE'): metric not enabled" time="2024-09-03T10:31:12Z" level=warning msg="Skipping line 20 ('DCGM_FI_PROF_PIPE_TENSOR_ACTIVE'): metric not enabled" time="2024-09-03T10:31:12Z" level=warning msg="Skipping line 21 ('DCGM_FI_PROF_DRAM_ACTIVE'): metric not enabled" time="2024-09-03T10:31:12Z" level=warning msg="Skipping line 22 ('DCGM_FI_PROF_PCIE_TX_BYTES'): metric not enabled" time="2024-09-03T10:31:12Z" level=warning msg="Skipping line 23 ('DCGM_FI_PROF_PCIE_RX_BYTES'): metric not enabled" time="2024-09-03T10:31:12Z" level=info msg="Initializing system entities of type: GPU" time="2024-09-03T10:31:12Z" level=info msg="Not collecting GPU metrics; Error getting devices count: Cannot perform the requested operation because NVML doesn't exist on this system." time="2024-09-03T10:31:12Z" level=info msg="Not collecting NvSwitch metrics; no fields to watch for device type: 3" time="2024-09-03T10:31:12Z" level=info msg="Not collecting NvLink metrics; no fields to watch for device type: 6" time="2024-09-03T10:31:12Z" level=info msg="Not collecting CPU metrics; no fields to watch for device type: 7" time="2024-09-03T10:31:12Z" level=info msg="Not collecting CPU Core metrics; no fields to watch for device type: 8" time="2024-09-03T10:31:12Z" level=info msg="Kubernetes metrics collection enabled!" time="2024-09-03T10:31:12Z" level=info msg="Pipeline starting" time="2024-09-03T10:31:12Z" level=info msg="Starting webserver" time="2024-09-03T10:31:12Z" level=info msg="Listening on" address="[::]:9400" time="2024-09-03T10:31:12Z" level=info msg="TLS is disabled." address="[::]:9400" http2=false `

But on assigning the GPU as a resource to it, getting deployed successfully :

` ➜ VectorDBBench git:(main) ✗ cat custom_values.yaml resources: limits: nvidia.com/gpu: "1"

➜ VectorDBBench git:(main) ✗

➜ VectorDBBench git:(main) ✗ kubectl get pods
NAME READY STATUS RESTARTS AGE dcgm-exporter-8ds87 1/1 Running 0 4m16s dcgm-exporter-8qd48 1/1 Running 0 4m16s dcgm-exporter-d9hq7 1/1 Running 0 4m16s dcgm-exporter-hsbbq 1/1 Running 0 4m16s dcgm-exporter-t49tt 1/1 Running 0 4m16s dcgm-exporter-xq57b 1/1 Running 0 4m16s vectordbbench-deployment-cf7974db6-r5scd 1/1 Running 0 41h ➜ VectorDBBench git:(main) ✗

➜ VectorDBBench git:(main) ✗ kubectl logs dcgm-exporter-8ds87 2024/09/03 10:03:35 maxprocs: Leaving GOMAXPROCS=8: CPU quota undefined time="2024-09-03T10:03:35Z" level=info msg="Starting dcgm-exporter" time="2024-09-03T10:03:35Z" level=info msg="DCGM successfully initialized!" time="2024-09-03T10:03:35Z" level=info msg="Collecting DCP Metrics" time="2024-09-03T10:03:35Z" level=info msg="Falling back to metric file '/etc/dcgm-exporter/dcp-metrics-included.csv'" time="2024-09-03T10:03:35Z" level=info msg="Initializing system entities of type: GPU" time="2024-09-03T10:03:35Z" level=info msg="Not collecting NvSwitch metrics; no fields to watch for device type: 3" time="2024-09-03T10:03:35Z" level=info msg="Not collecting NvLink metrics; no fields to watch for device type: 6" time="2024-09-03T10:03:35Z" level=info msg="Not collecting CPU metrics; no fields to watch for device type: 7" time="2024-09-03T10:03:35Z" level=info msg="Not collecting CPU Core metrics; no fields to watch for device type: 8" time="2024-09-03T10:03:35Z" level=info msg="Kubernetes metrics collection enabled!" time="2024-09-03T10:03:35Z" level=info msg="Pipeline starting" time="2024-09-03T10:03:35Z" level=info msg="Starting webserver" time="2024-09-03T10:03:35Z" level=info msg="Listening on" address="[::]:9400" time="2024-09-03T10:03:35Z" level=info msg="TLS is disabled." address="[::]:9400" http2=false ➜ VectorDBBench git:(main) ✗ `

But this is blocking other services from using GPU and dedicating it to the exporter.

Thanks, Rohit Mothe

nvvfedorov commented 2 months ago

@rohitreddy1698 , Please enable the debug mode in the DCGM-exporter by setting the environment variable: "DCGM_EXPORTER_DEBUG = true", then please share logs with us.

rohitreddy1698 commented 2 months ago

@nvvfedorov, Hi Sure. Here are the logs post setting the DEBUG variable to true.

➜  VectorDBBench git:(main) ✗ cat default_values.yaml 
image:
  repository: <mirror-docker-repo>/nvidia/k8s/dcgm-exporter

arguments:
  - "-f"
  - /etc/dcgm-exporter/dcp-metrics-included.csv
  - "-d"
  - f

extraEnv:
- name: "DCGM_EXPORTER_DEBUG"
  value: "true"

tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule

affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchExpressions:
        - key: cloud.google.com/gke-accelerator
          operator: In
          values:
          - nvidia-tesla-t4

priorityClassName: dcgm-exporter # new priority class because system-node-critical was full

➜  VectorDBBench git:(main) ✗ helm install --generate-name gpu-helm-charts/dcgm-exporter -f default_values.yaml
NAME: dcgm-exporter-1725446363
LAST DEPLOYED: Wed Sep  4 16:09:25 2024
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
1. Get the application URL by running these commands:
  export POD_NAME=$(kubectl get pods -n default -l "app.kubernetes.io/name=dcgm-exporter,app.kubernetes.io/instance=dcgm-exporter-1725446363" -o jsonpath="{.items[0].metadata.name}")
  kubectl -n default port-forward $POD_NAME 8080:9400 &
  echo "Visit http://127.0.0.1:8080/metrics to use your application"
➜  VectorDBBench git:(main) ✗ 

➜  VectorDBBench git:(main) ✗ kubectl get pods -o wide                       
NAME                                        READY   STATUS             RESTARTS        AGE     IP             NODE                                                  NOMINATED NODE   READINESS GATES
dcgm-exporter-1725446363-6xmd2              0/1     CrashLoopBackOff   6 (2m4s ago)    10m     240.16.10.17   gke-isds-genai-milvu-isds-genai-milvu-b74de53a-4xzj   <none>           <none>
dcgm-exporter-1725446363-8nzfd              0/1     CrashLoopBackOff   6 (2m14s ago)   10m     240.16.5.17    gke-isds-genai-milvu-isds-genai-milvu-309eed12-wwn5   <none>           <none>
dcgm-exporter-1725446363-96r99              0/1     CrashLoopBackOff   6 (2m4s ago)    10m     240.16.1.7     gke-isds-genai-milvu-isds-genai-milvu-309eed12-v5gd   <none>           <none>
dcgm-exporter-1725446363-k7rf6              0/1     CrashLoopBackOff   6 (2m9s ago)    10m     240.16.12.7    gke-isds-genai-milvu-isds-genai-milvu-b74de53a-vmsl   <none>           <none>
dcgm-exporter-1725446363-kfh9n              0/1     CrashLoopBackOff   7 (79s ago)     10m     240.16.9.17    gke-isds-genai-milvu-isds-genai-milvu-b74de53a-ntsq   <none>           <none>
dcgm-exporter-1725446363-pnvf9              0/1     CrashLoopBackOff   6 (2m19s ago)   10m     240.16.4.16    gke-isds-genai-milvu-isds-genai-milvu-309eed12-mnzb   <none>           <none>
vectordbbench-deployment-86c7d569b4-pb5r7   1/1     Running            0               5h20m   240.16.3.21    gke-isds-genai-milvu-isds-genai-milvu-4f7d36a6-7768   <none>           <none>
➜  VectorDBBench git:(main) ✗ 

➜  VectorDBBench git:(main) ✗ kubectl get nodes -o jsonpath="{range .items[*]}{.metadata.name}: {.status.allocatable}{'\n'}{end}"
gke-isds-genai-milvu-isds-genai-milvu-309eed12-mnzb: {"cpu":"7910m","ephemeral-storage":"47060071478","hugepages-1Gi":"0","hugepages-2Mi":"0","memory":"48418288Ki","nvidia.com/gpu":"2","pods":"110"}
gke-isds-genai-milvu-isds-genai-milvu-309eed12-v5gd: {"cpu":"7910m","ephemeral-storage":"47060071478","hugepages-1Gi":"0","hugepages-2Mi":"0","memory":"48418288Ki","nvidia.com/gpu":"2","pods":"110"}
gke-isds-genai-milvu-isds-genai-milvu-309eed12-wwn5: {"cpu":"7910m","ephemeral-storage":"47060071478","hugepages-1Gi":"0","hugepages-2Mi":"0","memory":"48418288Ki","nvidia.com/gpu":"2","pods":"110"}
gke-isds-genai-milvu-isds-genai-milvu-3fb32019-05h3: {"cpu":"7910m","ephemeral-storage":"47060071478","hugepages-1Gi":"0","hugepages-2Mi":"0","memory":"60048872Ki","pods":"110"}
gke-isds-genai-milvu-isds-genai-milvu-3fb32019-05jh: {"cpu":"7910m","ephemeral-storage":"47060071478","hugepages-1Gi":"0","hugepages-2Mi":"0","memory":"60048872Ki","pods":"110"}
gke-isds-genai-milvu-isds-genai-milvu-4f7d36a6-7768: {"cpu":"7910m","ephemeral-storage":"47060071478","hugepages-1Gi":"0","hugepages-2Mi":"0","memory":"60048872Ki","pods":"110"}
gke-isds-genai-milvu-isds-genai-milvu-4f7d36a6-s9f9: {"cpu":"7910m","ephemeral-storage":"47060071478","hugepages-1Gi":"0","hugepages-2Mi":"0","memory":"60048872Ki","pods":"110"}
gke-isds-genai-milvu-isds-genai-milvu-b74de53a-4xzj: {"cpu":"7910m","ephemeral-storage":"47060071478","hugepages-1Gi":"0","hugepages-2Mi":"0","memory":"48418288Ki","nvidia.com/gpu":"2","pods":"110"}
gke-isds-genai-milvu-isds-genai-milvu-b74de53a-ntsq: {"cpu":"7910m","ephemeral-storage":"47060071478","hugepages-1Gi":"0","hugepages-2Mi":"0","memory":"48418288Ki","nvidia.com/gpu":"2","pods":"110"}
gke-isds-genai-milvu-isds-genai-milvu-b74de53a-vmsl: {"cpu":"7910m","ephemeral-storage":"47060071478","hugepages-1Gi":"0","hugepages-2Mi":"0","memory":"48418288Ki","nvidia.com/gpu":"2","pods":"110"}

The logs from the dcgm-exporter pods :

➜  VectorDBBench git:(main) ✗ kubectl logs dcgm-exporter-1725446363-6xmd2
2024/09/04 10:43:17 maxprocs: Leaving GOMAXPROCS=8: CPU quota undefined
time="2024-09-04T10:43:17Z" level=info msg="Starting dcgm-exporter"
time="2024-09-04T10:43:17Z" level=debug msg="Debug output is enabled"
time="2024-09-04T10:43:17Z" level=debug msg="Command line: /usr/bin/dcgm-exporter -f /etc/dcgm-exporter/dcp-metrics-included.csv -d f"
time="2024-09-04T10:43:17Z" level=debug msg="Loaded configuration" dump="&{CollectorsFile:/etc/dcgm-exporter/dcp-metrics-included.csv Address::9400 CollectInterval:30000 Kubernetes:true KubernetesGPUIdType:uid CollectDCP:true UseOldNamespace:false UseRemoteHE:false RemoteHEInfo:localhost:5555 GPUDevices:{Flex:true MajorRange:[] MinorRange:[]} SwitchDevices:{Flex:true MajorRange:[] MinorRange:[]} CPUDevices:{Flex:true MajorRange:[] MinorRange:[]} NoHostname:false UseFakeGPUs:false ConfigMapData:none MetricGroups:[] WebSystemdSocket:false WebConfigFile: XIDCountWindowSize:300000 ReplaceBlanksInModelName:false Debug:true ClockEventsCountWindowSize:300000 EnableDCGMLog:false DCGMLogLevel:NONE PodResourcesKubeletSocket:/var/lib/kubelet/pod-resources/kubelet.sock HPCJobMappingDir: NvidiaResourceNames:[]}"
time="2024-09-04T10:43:17Z" level=info msg="DCGM successfully initialized!"
time="2024-09-04T10:43:17Z" level=info msg="Not collecting DCP metrics: This request is serviced by a module of DCGM that is not currently loaded"
time="2024-09-04T10:43:17Z" level=info msg="Falling back to metric file '/etc/dcgm-exporter/dcp-metrics-included.csv'"
time="2024-09-04T10:43:17Z" level=warning msg="Skipping line 20 ('DCGM_FI_PROF_GR_ENGINE_ACTIVE'): metric not enabled"
time="2024-09-04T10:43:17Z" level=warning msg="Skipping line 21 ('DCGM_FI_PROF_PIPE_TENSOR_ACTIVE'): metric not enabled"
time="2024-09-04T10:43:17Z" level=warning msg="Skipping line 22 ('DCGM_FI_PROF_DRAM_ACTIVE'): metric not enabled"
time="2024-09-04T10:43:17Z" level=warning msg="Skipping line 23 ('DCGM_FI_PROF_PCIE_TX_BYTES'): metric not enabled"
time="2024-09-04T10:43:17Z" level=warning msg="Skipping line 24 ('DCGM_FI_PROF_PCIE_RX_BYTES'): metric not enabled"
time="2024-09-04T10:43:17Z" level=info msg="Initializing system entities of type: GPU"
time="2024-09-04T10:43:17Z" level=info msg="Not collecting GPU metrics; Error getting devices count: Cannot perform the requested operation because NVML doesn't exist on this system."
time="2024-09-04T10:43:17Z" level=info msg="Not collecting NvSwitch metrics; no fields to watch for device type: 3"
time="2024-09-04T10:43:17Z" level=info msg="Not collecting NvLink metrics; no fields to watch for device type: 6"
time="2024-09-04T10:43:17Z" level=info msg="Not collecting CPU metrics; no fields to watch for device type: 7"
time="2024-09-04T10:43:17Z" level=info msg="Not collecting CPU Core metrics; no fields to watch for device type: 8"
time="2024-09-04T10:43:17Z" level=debug msg="Counters are initialized" dump="[{FieldID:100 FieldName:DCGM_FI_DEV_SM_CLOCK PromType:gauge Help:SM clock frequency (in MHz).} {FieldID:101 FieldName:DCGM_FI_DEV_MEM_CLOCK PromType:gauge Help:Memory clock frequency (in MHz).} {FieldID:140 FieldName:DCGM_FI_DEV_MEMORY_TEMP PromType:gauge Help:Memory temperature (in C).} {FieldID:150 FieldName:DCGM_FI_DEV_GPU_TEMP PromType:gauge Help:GPU temperature (in C).} {FieldID:155 FieldName:DCGM_FI_DEV_POWER_USAGE PromType:gauge Help:Power draw (in W).} {FieldID:156 FieldName:DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION PromType:counter Help:Total energy consumption since boot (in mJ).} {FieldID:202 FieldName:DCGM_FI_DEV_PCIE_REPLAY_COUNTER PromType:counter Help:Total number of PCIe retries.} {FieldID:203 FieldName:DCGM_FI_DEV_GPU_UTIL PromType:gauge Help:GPU utilization (in %).} {FieldID:204 FieldName:DCGM_FI_DEV_MEM_COPY_UTIL PromType:gauge Help:Memory utilization (in %).} {FieldID:206 FieldName:DCGM_FI_DEV_ENC_UTIL PromType:gauge Help:Encoder utilization (in %).} {FieldID:207 FieldName:DCGM_FI_DEV_DEC_UTIL PromType:gauge Help:Decoder utilization (in %).} {FieldID:230 FieldName:DCGM_FI_DEV_XID_ERRORS PromType:gauge Help:Value of the last XID error encountered.} {FieldID:251 FieldName:DCGM_FI_DEV_FB_FREE PromType:gauge Help:Framebuffer memory free (in MiB).} {FieldID:252 FieldName:DCGM_FI_DEV_FB_USED PromType:gauge Help:Framebuffer memory used (in MiB).} {FieldID:449 FieldName:DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL PromType:counter Help:Total number of NVLink bandwidth counters for all lanes.} {FieldID:526 FieldName:DCGM_FI_DEV_VGPU_LICENSE_STATUS PromType:gauge Help:vGPU License status} {FieldID:393 FieldName:DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS PromType:counter Help:Number of remapped rows for uncorrectable errors} {FieldID:394 FieldName:DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS PromType:counter Help:Number of remapped rows for correctable errors} {FieldID:395 FieldName:DCGM_FI_DEV_ROW_REMAP_FAILURE PromType:gauge Help:Whether remapping of rows has failed} {FieldID:1 FieldName:DCGM_FI_DRIVER_VERSION PromType:label Help:Driver Version}]"
time="2024-09-04T10:43:17Z" level=info msg="Kubernetes metrics collection enabled!"
time="2024-09-04T10:43:17Z" level=info msg="Starting webserver"
time="2024-09-04T10:43:17Z" level=info msg="Pipeline starting"
time="2024-09-04T10:43:17Z" level=info msg="Listening on" address="[::]:9400"
time="2024-09-04T10:43:17Z" level=info msg="TLS is disabled." address="[::]:9400" http2=false
➜  VectorDBBench git:(main) ✗ 
➜  VectorDBBench git:(main) ✗ 
➜  VectorDBBench git:(main) ✗ kubectl describe /poddcgm-exporter-1725446363-6xmd2
error: arguments in resource/name form must have a single resource and name
➜  VectorDBBench git:(main) ✗ kubectl describe pod/dcgm-exporter-1725446363-6xmd2 
Name:                 dcgm-exporter-1725446363-6xmd2
Namespace:            default
Priority:             1000000
Priority Class Name:  dcgm-exporter
Service Account:      dcgm-exporter-1725446363
Node:                 gke-isds-genai-milvu-isds-genai-milvu-b74de53a-4xzj/172.18.63.202
Start Time:           Wed, 04 Sep 2024 16:09:31 +0530
Labels:               app.kubernetes.io/component=dcgm-exporter
                      app.kubernetes.io/instance=dcgm-exporter-1725446363
                      app.kubernetes.io/name=dcgm-exporter
                      controller-revision-hash=77dfdc9d74
                      pod-template-generation=1
Annotations:          cni.projectcalico.org/containerID: e04a790705be8c32c8d0c0ef00912eac83a1f7846a58d3cb64c7a65828296e7a
                      cni.projectcalico.org/podIP: 240.16.10.17/32
                      cni.projectcalico.org/podIPs: 240.16.10.17/32
Status:               Running
IP:                   240.16.10.17
IPs:
  IP:           240.16.10.17
Controlled By:  DaemonSet/dcgm-exporter-1725446363
Containers:
  exporter:
    Container ID:  containerd://82ffb986ed8329ea6b65f5ff017699f4835a55256c1d630a322ccf32c0a50db9
    Image:         <internal-repo>/nvidia/k8s/dcgm-exporter:3.3.7-3.5.0-ubuntu22.04
    Image ID:      <internal-repo>/nvidia/k8s/dcgm-exporter@sha256:98781424e83e14e8855aa6881e5ca8e68c81fdc75c82dd1bb3fe924349aee9d4
    Port:          9400/TCP
    Host Port:     0/TCP
    Args:
      -f
      /etc/dcgm-exporter/dcp-metrics-included.csv
      -d
      f
    State:          Running
      Started:      Wed, 04 Sep 2024 16:13:16 +0530
    Last State:     Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Wed, 04 Sep 2024 16:12:21 +0530
      Finished:     Wed, 04 Sep 2024 16:13:16 +0530
    Ready:          False
    Restart Count:  4
    Liveness:       http-get http://:9400/health delay=45s timeout=1s period=5s #success=1 #failure=3
    Readiness:      http-get http://:9400/health delay=45s timeout=1s period=10s #success=1 #failure=3
    Environment:
      DCGM_EXPORTER_KUBERNETES:  true
      DCGM_EXPORTER_LISTEN:      :9400
      NODE_NAME:                  (v1:spec.nodeName)
      DCGM_EXPORTER_DEBUG:       true
    Mounts:
      /var/lib/kubelet/pod-resources from pod-gpu-resources (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-lt9r6 (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True 
  Initialized                 True 
  Ready                       False 
  ContainersReady             False 
  PodScheduled                True 
Volumes:
  pod-gpu-resources:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet/pod-resources
    HostPathType:  
  kube-api-access-lt9r6:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
                             nvidia.com/gpu:NoSchedule op=Exists
Events:
  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Scheduled  4m31s                  default-scheduler  Successfully assigned default/dcgm-exporter-1725446363-6xmd2 to gke-isds-genai-milvu-isds-genai-milvu-b74de53a-4xzj
  Normal   Pulled     2m36s (x3 over 4m31s)  kubelet            Container image "<internal-repo>/nvidia/k8s/dcgm-exporter:3.3.7-3.5.0-ubuntu22.04" already present on machine
  Normal   Created    2m36s (x3 over 4m31s)  kubelet            Created container exporter
  Normal   Started    2m36s (x3 over 4m30s)  kubelet            Started container exporter
  Normal   Killing    2m36s (x2 over 3m31s)  kubelet            Container exporter failed liveness probe, will be restarted
  Warning  Unhealthy  2m35s                  kubelet            Readiness probe failed: Get "http://240.16.10.17:9400/health": dial tcp 240.16.10.17:9400: connect: connection refused
  Warning  Unhealthy  111s (x7 over 3m41s)   kubelet            Liveness probe failed: HTTP probe failed with statuscode: 503
  Warning  Unhealthy  111s (x6 over 3m41s)   kubelet            Readiness probe failed: HTTP probe failed with statuscode: 503
➜  VectorDBBench git:(main) ✗ 
nvvfedorov commented 2 months ago

The error "Cannot perform the requested operation because NVML doesn't exist on this system." tell us that is something wrong with your K8S node configuration. Did you installed (NVIDIA container toolkit)[https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html]?

rohitreddy1698 commented 2 months ago

Yes i have installed the NVIDIA container toolkit.

I already have pods using the GPU. So the NVML should be correctly installed.

➜  VectorDBBench git:(main) ✗ kubectl exec -it isds-milvus-milvus-querynode-0-54b54d6fbc-2pjmc -n isds-milvus -- /bin/bash
Defaulted container "querynode" out of: querynode, config (init)
root@isds-milvus-milvus-querynode-0-54b54d6fbc-2pjmc:/milvus# nvidia-smi
Thu Sep  5 05:36:55 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   77C    P0             36W /   70W |    7273MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+
root@isds-milvus-milvus-querynode-0-54b54d6fbc-2pjmc:/milvus# 

Also one more confirmation is that if I assign GPU resources to the DCGM exporter pods they are working fine.

rohitreddy1698 commented 2 months ago

@nvvfedorov , were did you have a chance to take a look at this?

Thanks, Rohit

hanxiaop commented 1 month ago

You may need to specify runtimeClassName: nvidia in your dcgm pod spec.

petewall commented 1 month ago

I'm having the same issue. GKE cluster, with a V100 GPU. DCGM deployed using this helm chart values:

extraEnv:
  - name: DCGM_EXPORTER_DEBUG
    value: "true"
  - name: DCGM_EXPORTER_INTERVAL
    value: "10000"

tolerations:
  - key: nvidia.com/gpu
    operator: Equal
    value: present
    effect: NoSchedule
2024/10/09 16:47:56 maxprocs: Leaving GOMAXPROCS=1: CPU quota undefined
time="2024-10-09T16:47:56Z" level=info msg="Starting dcgm-exporter"
time="2024-10-09T16:47:56Z" level=debug msg="Debug output is enabled"
time="2024-10-09T16:47:56Z" level=debug msg="Command line: /usr/bin/dcgm-exporter -f /etc/dcgm-exporter/dcp-metrics-included.csv"
time="2024-10-09T16:47:56Z" level=debug msg="Loaded configuration" dump="&{CollectorsFile:/etc/dcgm-exporter/dcp-metrics-included.csv Address::9400 CollectInterval:10000 Kubernetes:true KubernetesGPUIdType:uid CollectDCP:true UseOldNamespace:false UseRemoteHE:false RemoteHEInfo:localhost:5555 GPUDevices:{Flex:true MajorRange:] MinorRange:]} SwitchDevices:{Flex:true MajorRange:] MinorRange:]} CPUDevices:{Flex:true MajorRange:] MinorRange:]} NoHostname:false UseFakeGPUs:false ConfigMapData:none MetricGroups:] WebSystemdSocket:false WebConfigFile: XIDCountWindowSize:300000 ReplaceBlanksInModelName:false Debug:true ClockEventsCountWindowSize:300000 EnableDCGMLog:false DCGMLogLevel:NONE PodResourcesKubeletSocket:/var/lib/kubelet/pod-resources/kubelet.sock HPCJobMappingDir: NvidiaResourceNames:]}"
time="2024-10-09T16:47:56Z" level=info msg="DCGM successfully initialized!"
time="2024-10-09T16:47:56Z" level=info msg="Not collecting DCP metrics: This request is serviced by a module of DCGM that is not currently loaded"
time="2024-10-09T16:47:56Z" level=info msg="Falling back to metric file '/etc/dcgm-exporter/dcp-metrics-included.csv'"
time="2024-10-09T16:47:56Z" level=warning msg="Skipping line 19 ('DCGM_FI_PROF_GR_ENGINE_ACTIVE'): metric not enabled"
time="2024-10-09T16:47:56Z" level=warning msg="Skipping line 20 ('DCGM_FI_PROF_PIPE_TENSOR_ACTIVE'): metric not enabled"
time="2024-10-09T16:47:56Z" level=warning msg="Skipping line 21 ('DCGM_FI_PROF_DRAM_ACTIVE'): metric not enabled"
time="2024-10-09T16:47:56Z" level=warning msg="Skipping line 22 ('DCGM_FI_PROF_PCIE_TX_BYTES'): metric not enabled"
time="2024-10-09T16:47:56Z" level=warning msg="Skipping line 23 ('DCGM_FI_PROF_PCIE_RX_BYTES'): metric not enabled"
time="2024-10-09T16:47:56Z" level=info msg="Initializing system entities of type: GPU"
time="2024-10-09T16:47:56Z" level=info msg="Not collecting GPU metrics; Error getting devices count: Cannot perform the requested operation because NVML doesn't exist on this system."
time="2024-10-09T16:47:56Z" level=info msg="Not collecting NvSwitch metrics; no fields to watch for device type: 3"
time="2024-10-09T16:47:56Z" level=info msg="Not collecting NvLink metrics; no fields to watch for device type: 6"
time="2024-10-09T16:47:56Z" level=info msg="Not collecting CPU metrics; no fields to watch for device type: 7"
time="2024-10-09T16:47:56Z" level=info msg="Not collecting CPU Core metrics; no fields to watch for device type: 8"
time="2024-10-09T16:47:56Z" level=debug msg="Counters are initialized" dump="[{FieldID:100 FieldName:DCGM_FI_DEV_SM_CLOCK PromType:gauge Help:SM clock frequency (in MHz).} {FieldID:101 FieldName:DCGM_FI_DEV_MEM_CLOCK PromType:gauge Help:Memory clock frequency (in MHz).} {FieldID:140 FieldName:DCGM_FI_DEV_MEMORY_TEMP PromType:gauge Help:Memory temperature (in C).} {FieldID:150 FieldName:DCGM_FI_DEV_GPU_TEMP PromType:gauge Help:GPU temperature (in C).} {FieldID:155 FieldName:DCGM_FI_DEV_POWER_USAGE PromType:gauge Help:Power draw (in W).} {FieldID:156 FieldName:DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION PromType:counter Help:Total energy consumption since boot (in mJ).} {FieldID:202 FieldName:DCGM_FI_DEV_PCIE_REPLAY_COUNTER PromType:counter Help:Total number of PCIe retries.} {FieldID:203 FieldName:DCGM_FI_DEV_GPU_UTIL PromType:gauge Help:GPU utilization (in %).} {FieldID:204 FieldName:DCGM_FI_DEV_MEM_COPY_UTIL PromType:gauge Help:Memory utilization (in %).} {FieldID:206 FieldName:DCGM_FI_DEV_ENC_UTIL PromType:gauge Help:Encoder utilization (in %).} {FieldID:207 FieldName:DCGM_FI_DEV_DEC_UTIL PromType:gauge Help:Decoder utilization (in %).} {FieldID:230 FieldName:DCGM_FI_DEV_XID_ERRORS PromType:gauge Help:Value of the last XID error encountered.} {FieldID:251 FieldName:DCGM_FI_DEV_FB_FREE PromType:gauge Help:Framebuffer memory free (in MiB).} {FieldID:252 FieldName:DCGM_FI_DEV_FB_USED PromType:gauge Help:Framebuffer memory used (in MiB).} {FieldID:449 FieldName:DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL PromType:counter Help:Total number of NVLink bandwidth counters for all lanes.} {FieldID:526 FieldName:DCGM_FI_DEV_VGPU_LICENSE_STATUS PromType:gauge Help:vGPU License status} {FieldID:393 FieldName:DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS PromType:counter Help:Number of remapped rows for uncorrectable errors} {FieldID:394 FieldName:DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS PromType:counter Help:Number of remapped rows for correctable errors} {FieldID:395 FieldName:DCGM_FI_DEV_ROW_REMAP_FAILURE PromType:gauge Help:Whether remapping of rows has failed}]"
time="2024-10-09T16:47:56Z" level=info msg="Kubernetes metrics collection enabled!"
time="2024-10-09T16:47:56Z" level=info msg="Pipeline starting"
time="2024-10-09T16:47:56Z" level=info msg="Starting webserver"
time="2024-10-09T16:47:56Z" level=info msg="Listening on" address=":9400"
time="2024-10-09T16:47:56Z" level=info msg="TLS is disabled." address=":9400" http2=false
Stream closed EOF for kube-system/dcgm-exporter-lxrnc (exporter)

I ran nvidia-smi with a pod and got this output:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla V100-SXM2-16GB           Off | 00000000:00:05.0 Off |                    0 |
| N/A   39C    P0              25W / 300W |      0MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

I'm using Google's autoprovisioned driver daemonsets, which did work for this node. A few select log statements:

nvidia-driver-installer I1009 13:32:20.104433    2531 install.go:264] Install GPU driver for device type: NVIDIA_TESLA_V100
...
nvidia-driver-installer Waiting for GPU driver libraries to be available.
nvidia-driver-installer GPU driver is installed.
...
nvidia-gpu-device-plugin I1009 13:33:24.671119    3853 metrics.go:144] nvml initialized successfully. Driver version: 535.183.01
nvidia-gpu-device-plugin I1009 13:33:24.671132    3853 devices.go:113] Found 1 GPU devices
nvidia-gpu-device-plugin I1009 13:33:24.676683    3853 devices.go:125] Found device nvidia0 for metrics collection

Can you help me figure out where to go next?

There is no nvidia RuntimeClass on my cluster.

andehpants commented 1 month ago

@rohitreddy1698 @petewall not sure if this will help with your issue, but the extra Helm values in this guide proved to be the solution.

securityContext:
  privileged: true

extraHostVolumes:
  - name: vulkan-icd-mount
    hostPath: /home/kubernetes/bin/nvidia/vulkan/icd.d
  - name: nvidia-install-dir-host
    hostPath: /home/kubernetes/bin/nvidia

extraVolumeMounts:
  - name: nvidia-install-dir-host
    mountPath: /usr/local/nvidia
    readOnly: true
  - name: vulkan-icd-mount
    mountPath: /etc/vulkan/icd.d
    readOnly: true

extraEnv:
- name: DCGM_EXPORTER_KUBERNETES_GPU_ID_TYPE
  value: device-name

My GPU node pool is running with COS and my drivers were installed manually by provisioning this DaemonSet here.

archae0pteryx commented 2 weeks ago

@andehpants Thank you for your search skills! That article is very good. I also had to add a new priority class because system-node-critical was full. Here's that for completeness since that's how i ended up in this thread myself.

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: dcgm-exporter
value: 1000
globalDefault: false
description: "Custom priority class for dcgm-exporter"
nvvfedorov commented 2 weeks ago

@andehpants , @archae0pteryx Thank you for your finding! What would you suggested to add to readme file to help other users?

archae0pteryx commented 2 weeks ago

@nvvfedorov Not sure really. Adjacent to the initial impetus of this issue, I would certainly add a little note about the priorityclass and an example of how / why you might need to create one? TBH, i had never worked with priorityclasses nor knew of their existence... I have my CKA even. 🙃 When I get mine working completely I may have more to add on the topic as I still have some issues that i'm trying to work through with the exporter. That said, I'm 99.9% sure it's just a misconfig/misunderstanding on my part at this point.

rohitreddy1698 commented 2 weeks ago

Hi @archae0pteryx , Thank you for the information ! I have tried creating the Priority Class before raising this issue, because without the priority class the DCGM exporter pods could not be scheduled at all, but post creating the priority class, the pods were scheduled and getting started. But Still I was seeing the same issue

➜  VectorDBBench git:(main) ✗ cat dcgm_priority_class.yaml 
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: dcgm-exporter
value: 1000000
globalDefault: false
description: >-
  (Optional) This priority class should only be used for dcgm exporter pods.  
➜  VectorDBBench git:(main) ✗ 
➜  VectorDBBench git:(main) ✗ cat dcgm_values.yaml 
image:
  repository: docker-upstream.apple.com/nvidia/k8s/dcgm-exporter

arguments:
  - "-f"
  - /etc/dcgm-exporter/dcp-metrics-included.csv
  - "-d"
  - f

securityContext:
  capabilities:
    add:
    - SYS_ADMIN

tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule

affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchExpressions:
        - key: cloud.google.com/gke-accelerator
          operator: In
          values:
          - nvidia-tesla-t4

extraConfigMapVolumes:
  - name: exporter-metrics-volume
    configMap:
      name: exporter-metrics-config-map
      items:
        - key: metrics
          path: dcp-metrics-included.csv

extraVolumeMounts:
  - name: exporter-metrics-volume
    mountPath: /etc/dcgm-exporter/dcp-metrics-included.csv
    subPath: dcp-metrics-included.csv

priorityClassName: dcgm-exporter
➜  VectorDBBench git:(main) ✗ 
time="2024-09-03T10:31:11Z" level=info msg="DCGM successfully initialized!"
time="2024-09-03T10:31:12Z" level=info msg="Not collecting DCP metrics: This request is serviced by a module of DCGM that is not currently loaded"
time="2024-09-03T10:31:12Z" level=info msg="Falling back to metric file '/etc/dcgm-exporter/dcp-metrics-included.csv'"
time="2024-09-03T10:31:12Z" level=warning msg="Skipping line 19 ('DCGM_FI_PROF_GR_ENGINE_ACTIVE'): metric not enabled"
time="2024-09-03T10:31:12Z" level=warning msg="Skipping line 20 ('DCGM_FI_PROF_PIPE_TENSOR_ACTIVE'): metric not enabled"
time="2024-09-03T10:31:12Z" level=warning msg="Skipping line 21 ('DCGM_FI_PROF_DRAM_ACTIVE'): metric not enabled"
time="2024-09-03T10:31:12Z" level=warning msg="Skipping line 22 ('DCGM_FI_PROF_PCIE_TX_BYTES'): metric not enabled"
time="2024-09-03T10:31:12Z" level=warning msg="Skipping line 23 ('DCGM_FI_PROF_PCIE_RX_BYTES'): metric not enabled"
time="2024-09-03T10:31:12Z" level=info msg="Initializing system entities of type: GPU"
time="2024-09-03T10:31:12Z" level=info msg="Not collecting GPU metrics; Error getting devices count: Cannot perform the requested operation because NVML doesn't exist on this system."