Open rohitreddy1698 opened 2 months ago
@rohitreddy1698 , Please enable the debug mode in the DCGM-exporter by setting the environment variable: "DCGM_EXPORTER_DEBUG = true", then please share logs with us.
@nvvfedorov, Hi Sure. Here are the logs post setting the DEBUG variable to true.
➜ VectorDBBench git:(main) ✗ cat default_values.yaml
image:
repository: <mirror-docker-repo>/nvidia/k8s/dcgm-exporter
arguments:
- "-f"
- /etc/dcgm-exporter/dcp-metrics-included.csv
- "-d"
- f
extraEnv:
- name: "DCGM_EXPORTER_DEBUG"
value: "true"
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: cloud.google.com/gke-accelerator
operator: In
values:
- nvidia-tesla-t4
priorityClassName: dcgm-exporter # new priority class because system-node-critical was full
➜ VectorDBBench git:(main) ✗ helm install --generate-name gpu-helm-charts/dcgm-exporter -f default_values.yaml
NAME: dcgm-exporter-1725446363
LAST DEPLOYED: Wed Sep 4 16:09:25 2024
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
1. Get the application URL by running these commands:
export POD_NAME=$(kubectl get pods -n default -l "app.kubernetes.io/name=dcgm-exporter,app.kubernetes.io/instance=dcgm-exporter-1725446363" -o jsonpath="{.items[0].metadata.name}")
kubectl -n default port-forward $POD_NAME 8080:9400 &
echo "Visit http://127.0.0.1:8080/metrics to use your application"
➜ VectorDBBench git:(main) ✗
➜ VectorDBBench git:(main) ✗ kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
dcgm-exporter-1725446363-6xmd2 0/1 CrashLoopBackOff 6 (2m4s ago) 10m 240.16.10.17 gke-isds-genai-milvu-isds-genai-milvu-b74de53a-4xzj <none> <none>
dcgm-exporter-1725446363-8nzfd 0/1 CrashLoopBackOff 6 (2m14s ago) 10m 240.16.5.17 gke-isds-genai-milvu-isds-genai-milvu-309eed12-wwn5 <none> <none>
dcgm-exporter-1725446363-96r99 0/1 CrashLoopBackOff 6 (2m4s ago) 10m 240.16.1.7 gke-isds-genai-milvu-isds-genai-milvu-309eed12-v5gd <none> <none>
dcgm-exporter-1725446363-k7rf6 0/1 CrashLoopBackOff 6 (2m9s ago) 10m 240.16.12.7 gke-isds-genai-milvu-isds-genai-milvu-b74de53a-vmsl <none> <none>
dcgm-exporter-1725446363-kfh9n 0/1 CrashLoopBackOff 7 (79s ago) 10m 240.16.9.17 gke-isds-genai-milvu-isds-genai-milvu-b74de53a-ntsq <none> <none>
dcgm-exporter-1725446363-pnvf9 0/1 CrashLoopBackOff 6 (2m19s ago) 10m 240.16.4.16 gke-isds-genai-milvu-isds-genai-milvu-309eed12-mnzb <none> <none>
vectordbbench-deployment-86c7d569b4-pb5r7 1/1 Running 0 5h20m 240.16.3.21 gke-isds-genai-milvu-isds-genai-milvu-4f7d36a6-7768 <none> <none>
➜ VectorDBBench git:(main) ✗
➜ VectorDBBench git:(main) ✗ kubectl get nodes -o jsonpath="{range .items[*]}{.metadata.name}: {.status.allocatable}{'\n'}{end}"
gke-isds-genai-milvu-isds-genai-milvu-309eed12-mnzb: {"cpu":"7910m","ephemeral-storage":"47060071478","hugepages-1Gi":"0","hugepages-2Mi":"0","memory":"48418288Ki","nvidia.com/gpu":"2","pods":"110"}
gke-isds-genai-milvu-isds-genai-milvu-309eed12-v5gd: {"cpu":"7910m","ephemeral-storage":"47060071478","hugepages-1Gi":"0","hugepages-2Mi":"0","memory":"48418288Ki","nvidia.com/gpu":"2","pods":"110"}
gke-isds-genai-milvu-isds-genai-milvu-309eed12-wwn5: {"cpu":"7910m","ephemeral-storage":"47060071478","hugepages-1Gi":"0","hugepages-2Mi":"0","memory":"48418288Ki","nvidia.com/gpu":"2","pods":"110"}
gke-isds-genai-milvu-isds-genai-milvu-3fb32019-05h3: {"cpu":"7910m","ephemeral-storage":"47060071478","hugepages-1Gi":"0","hugepages-2Mi":"0","memory":"60048872Ki","pods":"110"}
gke-isds-genai-milvu-isds-genai-milvu-3fb32019-05jh: {"cpu":"7910m","ephemeral-storage":"47060071478","hugepages-1Gi":"0","hugepages-2Mi":"0","memory":"60048872Ki","pods":"110"}
gke-isds-genai-milvu-isds-genai-milvu-4f7d36a6-7768: {"cpu":"7910m","ephemeral-storage":"47060071478","hugepages-1Gi":"0","hugepages-2Mi":"0","memory":"60048872Ki","pods":"110"}
gke-isds-genai-milvu-isds-genai-milvu-4f7d36a6-s9f9: {"cpu":"7910m","ephemeral-storage":"47060071478","hugepages-1Gi":"0","hugepages-2Mi":"0","memory":"60048872Ki","pods":"110"}
gke-isds-genai-milvu-isds-genai-milvu-b74de53a-4xzj: {"cpu":"7910m","ephemeral-storage":"47060071478","hugepages-1Gi":"0","hugepages-2Mi":"0","memory":"48418288Ki","nvidia.com/gpu":"2","pods":"110"}
gke-isds-genai-milvu-isds-genai-milvu-b74de53a-ntsq: {"cpu":"7910m","ephemeral-storage":"47060071478","hugepages-1Gi":"0","hugepages-2Mi":"0","memory":"48418288Ki","nvidia.com/gpu":"2","pods":"110"}
gke-isds-genai-milvu-isds-genai-milvu-b74de53a-vmsl: {"cpu":"7910m","ephemeral-storage":"47060071478","hugepages-1Gi":"0","hugepages-2Mi":"0","memory":"48418288Ki","nvidia.com/gpu":"2","pods":"110"}
The logs from the dcgm-exporter pods :
➜ VectorDBBench git:(main) ✗ kubectl logs dcgm-exporter-1725446363-6xmd2
2024/09/04 10:43:17 maxprocs: Leaving GOMAXPROCS=8: CPU quota undefined
time="2024-09-04T10:43:17Z" level=info msg="Starting dcgm-exporter"
time="2024-09-04T10:43:17Z" level=debug msg="Debug output is enabled"
time="2024-09-04T10:43:17Z" level=debug msg="Command line: /usr/bin/dcgm-exporter -f /etc/dcgm-exporter/dcp-metrics-included.csv -d f"
time="2024-09-04T10:43:17Z" level=debug msg="Loaded configuration" dump="&{CollectorsFile:/etc/dcgm-exporter/dcp-metrics-included.csv Address::9400 CollectInterval:30000 Kubernetes:true KubernetesGPUIdType:uid CollectDCP:true UseOldNamespace:false UseRemoteHE:false RemoteHEInfo:localhost:5555 GPUDevices:{Flex:true MajorRange:[] MinorRange:[]} SwitchDevices:{Flex:true MajorRange:[] MinorRange:[]} CPUDevices:{Flex:true MajorRange:[] MinorRange:[]} NoHostname:false UseFakeGPUs:false ConfigMapData:none MetricGroups:[] WebSystemdSocket:false WebConfigFile: XIDCountWindowSize:300000 ReplaceBlanksInModelName:false Debug:true ClockEventsCountWindowSize:300000 EnableDCGMLog:false DCGMLogLevel:NONE PodResourcesKubeletSocket:/var/lib/kubelet/pod-resources/kubelet.sock HPCJobMappingDir: NvidiaResourceNames:[]}"
time="2024-09-04T10:43:17Z" level=info msg="DCGM successfully initialized!"
time="2024-09-04T10:43:17Z" level=info msg="Not collecting DCP metrics: This request is serviced by a module of DCGM that is not currently loaded"
time="2024-09-04T10:43:17Z" level=info msg="Falling back to metric file '/etc/dcgm-exporter/dcp-metrics-included.csv'"
time="2024-09-04T10:43:17Z" level=warning msg="Skipping line 20 ('DCGM_FI_PROF_GR_ENGINE_ACTIVE'): metric not enabled"
time="2024-09-04T10:43:17Z" level=warning msg="Skipping line 21 ('DCGM_FI_PROF_PIPE_TENSOR_ACTIVE'): metric not enabled"
time="2024-09-04T10:43:17Z" level=warning msg="Skipping line 22 ('DCGM_FI_PROF_DRAM_ACTIVE'): metric not enabled"
time="2024-09-04T10:43:17Z" level=warning msg="Skipping line 23 ('DCGM_FI_PROF_PCIE_TX_BYTES'): metric not enabled"
time="2024-09-04T10:43:17Z" level=warning msg="Skipping line 24 ('DCGM_FI_PROF_PCIE_RX_BYTES'): metric not enabled"
time="2024-09-04T10:43:17Z" level=info msg="Initializing system entities of type: GPU"
time="2024-09-04T10:43:17Z" level=info msg="Not collecting GPU metrics; Error getting devices count: Cannot perform the requested operation because NVML doesn't exist on this system."
time="2024-09-04T10:43:17Z" level=info msg="Not collecting NvSwitch metrics; no fields to watch for device type: 3"
time="2024-09-04T10:43:17Z" level=info msg="Not collecting NvLink metrics; no fields to watch for device type: 6"
time="2024-09-04T10:43:17Z" level=info msg="Not collecting CPU metrics; no fields to watch for device type: 7"
time="2024-09-04T10:43:17Z" level=info msg="Not collecting CPU Core metrics; no fields to watch for device type: 8"
time="2024-09-04T10:43:17Z" level=debug msg="Counters are initialized" dump="[{FieldID:100 FieldName:DCGM_FI_DEV_SM_CLOCK PromType:gauge Help:SM clock frequency (in MHz).} {FieldID:101 FieldName:DCGM_FI_DEV_MEM_CLOCK PromType:gauge Help:Memory clock frequency (in MHz).} {FieldID:140 FieldName:DCGM_FI_DEV_MEMORY_TEMP PromType:gauge Help:Memory temperature (in C).} {FieldID:150 FieldName:DCGM_FI_DEV_GPU_TEMP PromType:gauge Help:GPU temperature (in C).} {FieldID:155 FieldName:DCGM_FI_DEV_POWER_USAGE PromType:gauge Help:Power draw (in W).} {FieldID:156 FieldName:DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION PromType:counter Help:Total energy consumption since boot (in mJ).} {FieldID:202 FieldName:DCGM_FI_DEV_PCIE_REPLAY_COUNTER PromType:counter Help:Total number of PCIe retries.} {FieldID:203 FieldName:DCGM_FI_DEV_GPU_UTIL PromType:gauge Help:GPU utilization (in %).} {FieldID:204 FieldName:DCGM_FI_DEV_MEM_COPY_UTIL PromType:gauge Help:Memory utilization (in %).} {FieldID:206 FieldName:DCGM_FI_DEV_ENC_UTIL PromType:gauge Help:Encoder utilization (in %).} {FieldID:207 FieldName:DCGM_FI_DEV_DEC_UTIL PromType:gauge Help:Decoder utilization (in %).} {FieldID:230 FieldName:DCGM_FI_DEV_XID_ERRORS PromType:gauge Help:Value of the last XID error encountered.} {FieldID:251 FieldName:DCGM_FI_DEV_FB_FREE PromType:gauge Help:Framebuffer memory free (in MiB).} {FieldID:252 FieldName:DCGM_FI_DEV_FB_USED PromType:gauge Help:Framebuffer memory used (in MiB).} {FieldID:449 FieldName:DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL PromType:counter Help:Total number of NVLink bandwidth counters for all lanes.} {FieldID:526 FieldName:DCGM_FI_DEV_VGPU_LICENSE_STATUS PromType:gauge Help:vGPU License status} {FieldID:393 FieldName:DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS PromType:counter Help:Number of remapped rows for uncorrectable errors} {FieldID:394 FieldName:DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS PromType:counter Help:Number of remapped rows for correctable errors} {FieldID:395 FieldName:DCGM_FI_DEV_ROW_REMAP_FAILURE PromType:gauge Help:Whether remapping of rows has failed} {FieldID:1 FieldName:DCGM_FI_DRIVER_VERSION PromType:label Help:Driver Version}]"
time="2024-09-04T10:43:17Z" level=info msg="Kubernetes metrics collection enabled!"
time="2024-09-04T10:43:17Z" level=info msg="Starting webserver"
time="2024-09-04T10:43:17Z" level=info msg="Pipeline starting"
time="2024-09-04T10:43:17Z" level=info msg="Listening on" address="[::]:9400"
time="2024-09-04T10:43:17Z" level=info msg="TLS is disabled." address="[::]:9400" http2=false
➜ VectorDBBench git:(main) ✗
➜ VectorDBBench git:(main) ✗
➜ VectorDBBench git:(main) ✗ kubectl describe /poddcgm-exporter-1725446363-6xmd2
error: arguments in resource/name form must have a single resource and name
➜ VectorDBBench git:(main) ✗ kubectl describe pod/dcgm-exporter-1725446363-6xmd2
Name: dcgm-exporter-1725446363-6xmd2
Namespace: default
Priority: 1000000
Priority Class Name: dcgm-exporter
Service Account: dcgm-exporter-1725446363
Node: gke-isds-genai-milvu-isds-genai-milvu-b74de53a-4xzj/172.18.63.202
Start Time: Wed, 04 Sep 2024 16:09:31 +0530
Labels: app.kubernetes.io/component=dcgm-exporter
app.kubernetes.io/instance=dcgm-exporter-1725446363
app.kubernetes.io/name=dcgm-exporter
controller-revision-hash=77dfdc9d74
pod-template-generation=1
Annotations: cni.projectcalico.org/containerID: e04a790705be8c32c8d0c0ef00912eac83a1f7846a58d3cb64c7a65828296e7a
cni.projectcalico.org/podIP: 240.16.10.17/32
cni.projectcalico.org/podIPs: 240.16.10.17/32
Status: Running
IP: 240.16.10.17
IPs:
IP: 240.16.10.17
Controlled By: DaemonSet/dcgm-exporter-1725446363
Containers:
exporter:
Container ID: containerd://82ffb986ed8329ea6b65f5ff017699f4835a55256c1d630a322ccf32c0a50db9
Image: <internal-repo>/nvidia/k8s/dcgm-exporter:3.3.7-3.5.0-ubuntu22.04
Image ID: <internal-repo>/nvidia/k8s/dcgm-exporter@sha256:98781424e83e14e8855aa6881e5ca8e68c81fdc75c82dd1bb3fe924349aee9d4
Port: 9400/TCP
Host Port: 0/TCP
Args:
-f
/etc/dcgm-exporter/dcp-metrics-included.csv
-d
f
State: Running
Started: Wed, 04 Sep 2024 16:13:16 +0530
Last State: Terminated
Reason: Completed
Exit Code: 0
Started: Wed, 04 Sep 2024 16:12:21 +0530
Finished: Wed, 04 Sep 2024 16:13:16 +0530
Ready: False
Restart Count: 4
Liveness: http-get http://:9400/health delay=45s timeout=1s period=5s #success=1 #failure=3
Readiness: http-get http://:9400/health delay=45s timeout=1s period=10s #success=1 #failure=3
Environment:
DCGM_EXPORTER_KUBERNETES: true
DCGM_EXPORTER_LISTEN: :9400
NODE_NAME: (v1:spec.nodeName)
DCGM_EXPORTER_DEBUG: true
Mounts:
/var/lib/kubelet/pod-resources from pod-gpu-resources (ro)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-lt9r6 (ro)
Conditions:
Type Status
PodReadyToStartContainers True
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
pod-gpu-resources:
Type: HostPath (bare host directory volume)
Path: /var/lib/kubelet/pod-resources
HostPathType:
kube-api-access-lt9r6:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
nvidia.com/gpu:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 4m31s default-scheduler Successfully assigned default/dcgm-exporter-1725446363-6xmd2 to gke-isds-genai-milvu-isds-genai-milvu-b74de53a-4xzj
Normal Pulled 2m36s (x3 over 4m31s) kubelet Container image "<internal-repo>/nvidia/k8s/dcgm-exporter:3.3.7-3.5.0-ubuntu22.04" already present on machine
Normal Created 2m36s (x3 over 4m31s) kubelet Created container exporter
Normal Started 2m36s (x3 over 4m30s) kubelet Started container exporter
Normal Killing 2m36s (x2 over 3m31s) kubelet Container exporter failed liveness probe, will be restarted
Warning Unhealthy 2m35s kubelet Readiness probe failed: Get "http://240.16.10.17:9400/health": dial tcp 240.16.10.17:9400: connect: connection refused
Warning Unhealthy 111s (x7 over 3m41s) kubelet Liveness probe failed: HTTP probe failed with statuscode: 503
Warning Unhealthy 111s (x6 over 3m41s) kubelet Readiness probe failed: HTTP probe failed with statuscode: 503
➜ VectorDBBench git:(main) ✗
The error "Cannot perform the requested operation because NVML doesn't exist on this system." tell us that is something wrong with your K8S node configuration. Did you installed (NVIDIA container toolkit)[https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html]?
Yes i have installed the NVIDIA container toolkit.
I already have pods using the GPU. So the NVML should be correctly installed.
➜ VectorDBBench git:(main) ✗ kubectl exec -it isds-milvus-milvus-querynode-0-54b54d6fbc-2pjmc -n isds-milvus -- /bin/bash
Defaulted container "querynode" out of: querynode, config (init)
root@isds-milvus-milvus-querynode-0-54b54d6fbc-2pjmc:/milvus# nvidia-smi
Thu Sep 5 05:36:55 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 Tesla T4 Off | 00000000:00:04.0 Off | 0 |
| N/A 77C P0 36W / 70W | 7273MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+
root@isds-milvus-milvus-querynode-0-54b54d6fbc-2pjmc:/milvus#
Also one more confirmation is that if I assign GPU resources to the DCGM exporter pods they are working fine.
@nvvfedorov , were did you have a chance to take a look at this?
Thanks, Rohit
You may need to specify runtimeClassName: nvidia
in your dcgm pod spec.
I'm having the same issue. GKE cluster, with a V100 GPU. DCGM deployed using this helm chart values:
extraEnv:
- name: DCGM_EXPORTER_DEBUG
value: "true"
- name: DCGM_EXPORTER_INTERVAL
value: "10000"
tolerations:
- key: nvidia.com/gpu
operator: Equal
value: present
effect: NoSchedule
2024/10/09 16:47:56 maxprocs: Leaving GOMAXPROCS=1: CPU quota undefined
time="2024-10-09T16:47:56Z" level=info msg="Starting dcgm-exporter"
time="2024-10-09T16:47:56Z" level=debug msg="Debug output is enabled"
time="2024-10-09T16:47:56Z" level=debug msg="Command line: /usr/bin/dcgm-exporter -f /etc/dcgm-exporter/dcp-metrics-included.csv"
time="2024-10-09T16:47:56Z" level=debug msg="Loaded configuration" dump="&{CollectorsFile:/etc/dcgm-exporter/dcp-metrics-included.csv Address::9400 CollectInterval:10000 Kubernetes:true KubernetesGPUIdType:uid CollectDCP:true UseOldNamespace:false UseRemoteHE:false RemoteHEInfo:localhost:5555 GPUDevices:{Flex:true MajorRange:] MinorRange:]} SwitchDevices:{Flex:true MajorRange:] MinorRange:]} CPUDevices:{Flex:true MajorRange:] MinorRange:]} NoHostname:false UseFakeGPUs:false ConfigMapData:none MetricGroups:] WebSystemdSocket:false WebConfigFile: XIDCountWindowSize:300000 ReplaceBlanksInModelName:false Debug:true ClockEventsCountWindowSize:300000 EnableDCGMLog:false DCGMLogLevel:NONE PodResourcesKubeletSocket:/var/lib/kubelet/pod-resources/kubelet.sock HPCJobMappingDir: NvidiaResourceNames:]}"
time="2024-10-09T16:47:56Z" level=info msg="DCGM successfully initialized!"
time="2024-10-09T16:47:56Z" level=info msg="Not collecting DCP metrics: This request is serviced by a module of DCGM that is not currently loaded"
time="2024-10-09T16:47:56Z" level=info msg="Falling back to metric file '/etc/dcgm-exporter/dcp-metrics-included.csv'"
time="2024-10-09T16:47:56Z" level=warning msg="Skipping line 19 ('DCGM_FI_PROF_GR_ENGINE_ACTIVE'): metric not enabled"
time="2024-10-09T16:47:56Z" level=warning msg="Skipping line 20 ('DCGM_FI_PROF_PIPE_TENSOR_ACTIVE'): metric not enabled"
time="2024-10-09T16:47:56Z" level=warning msg="Skipping line 21 ('DCGM_FI_PROF_DRAM_ACTIVE'): metric not enabled"
time="2024-10-09T16:47:56Z" level=warning msg="Skipping line 22 ('DCGM_FI_PROF_PCIE_TX_BYTES'): metric not enabled"
time="2024-10-09T16:47:56Z" level=warning msg="Skipping line 23 ('DCGM_FI_PROF_PCIE_RX_BYTES'): metric not enabled"
time="2024-10-09T16:47:56Z" level=info msg="Initializing system entities of type: GPU"
time="2024-10-09T16:47:56Z" level=info msg="Not collecting GPU metrics; Error getting devices count: Cannot perform the requested operation because NVML doesn't exist on this system."
time="2024-10-09T16:47:56Z" level=info msg="Not collecting NvSwitch metrics; no fields to watch for device type: 3"
time="2024-10-09T16:47:56Z" level=info msg="Not collecting NvLink metrics; no fields to watch for device type: 6"
time="2024-10-09T16:47:56Z" level=info msg="Not collecting CPU metrics; no fields to watch for device type: 7"
time="2024-10-09T16:47:56Z" level=info msg="Not collecting CPU Core metrics; no fields to watch for device type: 8"
time="2024-10-09T16:47:56Z" level=debug msg="Counters are initialized" dump="[{FieldID:100 FieldName:DCGM_FI_DEV_SM_CLOCK PromType:gauge Help:SM clock frequency (in MHz).} {FieldID:101 FieldName:DCGM_FI_DEV_MEM_CLOCK PromType:gauge Help:Memory clock frequency (in MHz).} {FieldID:140 FieldName:DCGM_FI_DEV_MEMORY_TEMP PromType:gauge Help:Memory temperature (in C).} {FieldID:150 FieldName:DCGM_FI_DEV_GPU_TEMP PromType:gauge Help:GPU temperature (in C).} {FieldID:155 FieldName:DCGM_FI_DEV_POWER_USAGE PromType:gauge Help:Power draw (in W).} {FieldID:156 FieldName:DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION PromType:counter Help:Total energy consumption since boot (in mJ).} {FieldID:202 FieldName:DCGM_FI_DEV_PCIE_REPLAY_COUNTER PromType:counter Help:Total number of PCIe retries.} {FieldID:203 FieldName:DCGM_FI_DEV_GPU_UTIL PromType:gauge Help:GPU utilization (in %).} {FieldID:204 FieldName:DCGM_FI_DEV_MEM_COPY_UTIL PromType:gauge Help:Memory utilization (in %).} {FieldID:206 FieldName:DCGM_FI_DEV_ENC_UTIL PromType:gauge Help:Encoder utilization (in %).} {FieldID:207 FieldName:DCGM_FI_DEV_DEC_UTIL PromType:gauge Help:Decoder utilization (in %).} {FieldID:230 FieldName:DCGM_FI_DEV_XID_ERRORS PromType:gauge Help:Value of the last XID error encountered.} {FieldID:251 FieldName:DCGM_FI_DEV_FB_FREE PromType:gauge Help:Framebuffer memory free (in MiB).} {FieldID:252 FieldName:DCGM_FI_DEV_FB_USED PromType:gauge Help:Framebuffer memory used (in MiB).} {FieldID:449 FieldName:DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL PromType:counter Help:Total number of NVLink bandwidth counters for all lanes.} {FieldID:526 FieldName:DCGM_FI_DEV_VGPU_LICENSE_STATUS PromType:gauge Help:vGPU License status} {FieldID:393 FieldName:DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS PromType:counter Help:Number of remapped rows for uncorrectable errors} {FieldID:394 FieldName:DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS PromType:counter Help:Number of remapped rows for correctable errors} {FieldID:395 FieldName:DCGM_FI_DEV_ROW_REMAP_FAILURE PromType:gauge Help:Whether remapping of rows has failed}]"
time="2024-10-09T16:47:56Z" level=info msg="Kubernetes metrics collection enabled!"
time="2024-10-09T16:47:56Z" level=info msg="Pipeline starting"
time="2024-10-09T16:47:56Z" level=info msg="Starting webserver"
time="2024-10-09T16:47:56Z" level=info msg="Listening on" address=":9400"
time="2024-10-09T16:47:56Z" level=info msg="TLS is disabled." address=":9400" http2=false
Stream closed EOF for kube-system/dcgm-exporter-lxrnc (exporter)
I ran nvidia-smi
with a pod and got this output:
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01 Driver Version: 535.183.01 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla V100-SXM2-16GB Off | 00000000:00:05.0 Off | 0 |
| N/A 39C P0 25W / 300W | 0MiB / 16384MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
I'm using Google's autoprovisioned driver daemonsets, which did work for this node. A few select log statements:
nvidia-driver-installer I1009 13:32:20.104433 2531 install.go:264] Install GPU driver for device type: NVIDIA_TESLA_V100
...
nvidia-driver-installer Waiting for GPU driver libraries to be available.
nvidia-driver-installer GPU driver is installed.
...
nvidia-gpu-device-plugin I1009 13:33:24.671119 3853 metrics.go:144] nvml initialized successfully. Driver version: 535.183.01
nvidia-gpu-device-plugin I1009 13:33:24.671132 3853 devices.go:113] Found 1 GPU devices
nvidia-gpu-device-plugin I1009 13:33:24.676683 3853 devices.go:125] Found device nvidia0 for metrics collection
Can you help me figure out where to go next?
There is no nvidia
RuntimeClass on my cluster.
@rohitreddy1698 @petewall not sure if this will help with your issue, but the extra Helm values in this guide proved to be the solution.
securityContext:
privileged: true
extraHostVolumes:
- name: vulkan-icd-mount
hostPath: /home/kubernetes/bin/nvidia/vulkan/icd.d
- name: nvidia-install-dir-host
hostPath: /home/kubernetes/bin/nvidia
extraVolumeMounts:
- name: nvidia-install-dir-host
mountPath: /usr/local/nvidia
readOnly: true
- name: vulkan-icd-mount
mountPath: /etc/vulkan/icd.d
readOnly: true
extraEnv:
- name: DCGM_EXPORTER_KUBERNETES_GPU_ID_TYPE
value: device-name
My GPU node pool is running with COS and my drivers were installed manually by provisioning this DaemonSet here.
@andehpants Thank you for your search skills! That article is very good. I also had to add a new priority class because system-node-critical was full. Here's that for completeness since that's how i ended up in this thread myself.
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: dcgm-exporter
value: 1000
globalDefault: false
description: "Custom priority class for dcgm-exporter"
@andehpants , @archae0pteryx Thank you for your finding! What would you suggested to add to readme file to help other users?
@nvvfedorov Not sure really. Adjacent to the initial impetus of this issue, I would certainly add a little note about the priorityclass and an example of how / why you might need to create one? TBH, i had never worked with priorityclasses nor knew of their existence... I have my CKA even. 🙃 When I get mine working completely I may have more to add on the topic as I still have some issues that i'm trying to work through with the exporter. That said, I'm 99.9% sure it's just a misconfig/misunderstanding on my part at this point.
Hi @archae0pteryx , Thank you for the information ! I have tried creating the Priority Class before raising this issue, because without the priority class the DCGM exporter pods could not be scheduled at all, but post creating the priority class, the pods were scheduled and getting started. But Still I was seeing the same issue
➜ VectorDBBench git:(main) ✗ cat dcgm_priority_class.yaml
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: dcgm-exporter
value: 1000000
globalDefault: false
description: >-
(Optional) This priority class should only be used for dcgm exporter pods.
➜ VectorDBBench git:(main) ✗
➜ VectorDBBench git:(main) ✗ cat dcgm_values.yaml
image:
repository: docker-upstream.apple.com/nvidia/k8s/dcgm-exporter
arguments:
- "-f"
- /etc/dcgm-exporter/dcp-metrics-included.csv
- "-d"
- f
securityContext:
capabilities:
add:
- SYS_ADMIN
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: cloud.google.com/gke-accelerator
operator: In
values:
- nvidia-tesla-t4
extraConfigMapVolumes:
- name: exporter-metrics-volume
configMap:
name: exporter-metrics-config-map
items:
- key: metrics
path: dcp-metrics-included.csv
extraVolumeMounts:
- name: exporter-metrics-volume
mountPath: /etc/dcgm-exporter/dcp-metrics-included.csv
subPath: dcp-metrics-included.csv
priorityClassName: dcgm-exporter
➜ VectorDBBench git:(main) ✗
time="2024-09-03T10:31:11Z" level=info msg="DCGM successfully initialized!"
time="2024-09-03T10:31:12Z" level=info msg="Not collecting DCP metrics: This request is serviced by a module of DCGM that is not currently loaded"
time="2024-09-03T10:31:12Z" level=info msg="Falling back to metric file '/etc/dcgm-exporter/dcp-metrics-included.csv'"
time="2024-09-03T10:31:12Z" level=warning msg="Skipping line 19 ('DCGM_FI_PROF_GR_ENGINE_ACTIVE'): metric not enabled"
time="2024-09-03T10:31:12Z" level=warning msg="Skipping line 20 ('DCGM_FI_PROF_PIPE_TENSOR_ACTIVE'): metric not enabled"
time="2024-09-03T10:31:12Z" level=warning msg="Skipping line 21 ('DCGM_FI_PROF_DRAM_ACTIVE'): metric not enabled"
time="2024-09-03T10:31:12Z" level=warning msg="Skipping line 22 ('DCGM_FI_PROF_PCIE_TX_BYTES'): metric not enabled"
time="2024-09-03T10:31:12Z" level=warning msg="Skipping line 23 ('DCGM_FI_PROF_PCIE_RX_BYTES'): metric not enabled"
time="2024-09-03T10:31:12Z" level=info msg="Initializing system entities of type: GPU"
time="2024-09-03T10:31:12Z" level=info msg="Not collecting GPU metrics; Error getting devices count: Cannot perform the requested operation because NVML doesn't exist on this system."
Ask your question
Hi Team,
I am using the dcgm-exporter, installed as a Helm Chart. I am using the default values.
I have other Milvus component pods : query node and index node successfully scheduled and in READY state running on GPU pods. Logging onto pod and running
nvidia-smi
command is successful.But the dcgm-exporter daemon set pods are stuck in error state :
` ➜ VectorDBBench git:(main) ✗ kubectl get pods
NAME READY STATUS RESTARTS AGE dcgm-exporter-btsln 0/1 Running 0 46s dcgm-exporter-c8gpg 0/1 Running 0 46s dcgm-exporter-f9jd7 0/1 Running 0 46s dcgm-exporter-xhs2v 0/1 Running 0 46s dcgm-exporter-z4pz4 0/1 Running 0 46s dcgm-exporter-zh854 0/1 Running 0 46s vectordbbench-deployment-cf7974db6-r5scd 1/1 Running 0 41h ➜ VectorDBBench git:(main) ✗
➜ VectorDBBench git:(main) ✗ kubectl logs dcgm-exporter-btsln
2024/09/03 10:31:11 maxprocs: Leaving GOMAXPROCS=8: CPU quota undefined time="2024-09-03T10:31:11Z" level=info msg="Starting dcgm-exporter" time="2024-09-03T10:31:11Z" level=info msg="DCGM successfully initialized!" time="2024-09-03T10:31:12Z" level=info msg="Not collecting DCP metrics: This request is serviced by a module of DCGM that is not currently loaded" time="2024-09-03T10:31:12Z" level=info msg="Falling back to metric file '/etc/dcgm-exporter/dcp-metrics-included.csv'" time="2024-09-03T10:31:12Z" level=warning msg="Skipping line 19 ('DCGM_FI_PROF_GR_ENGINE_ACTIVE'): metric not enabled" time="2024-09-03T10:31:12Z" level=warning msg="Skipping line 20 ('DCGM_FI_PROF_PIPE_TENSOR_ACTIVE'): metric not enabled" time="2024-09-03T10:31:12Z" level=warning msg="Skipping line 21 ('DCGM_FI_PROF_DRAM_ACTIVE'): metric not enabled" time="2024-09-03T10:31:12Z" level=warning msg="Skipping line 22 ('DCGM_FI_PROF_PCIE_TX_BYTES'): metric not enabled" time="2024-09-03T10:31:12Z" level=warning msg="Skipping line 23 ('DCGM_FI_PROF_PCIE_RX_BYTES'): metric not enabled" time="2024-09-03T10:31:12Z" level=info msg="Initializing system entities of type: GPU" time="2024-09-03T10:31:12Z" level=info msg="Not collecting GPU metrics; Error getting devices count: Cannot perform the requested operation because NVML doesn't exist on this system." time="2024-09-03T10:31:12Z" level=info msg="Not collecting NvSwitch metrics; no fields to watch for device type: 3" time="2024-09-03T10:31:12Z" level=info msg="Not collecting NvLink metrics; no fields to watch for device type: 6" time="2024-09-03T10:31:12Z" level=info msg="Not collecting CPU metrics; no fields to watch for device type: 7" time="2024-09-03T10:31:12Z" level=info msg="Not collecting CPU Core metrics; no fields to watch for device type: 8" time="2024-09-03T10:31:12Z" level=info msg="Kubernetes metrics collection enabled!" time="2024-09-03T10:31:12Z" level=info msg="Pipeline starting" time="2024-09-03T10:31:12Z" level=info msg="Starting webserver" time="2024-09-03T10:31:12Z" level=info msg="Listening on" address="[::]:9400" time="2024-09-03T10:31:12Z" level=info msg="TLS is disabled." address="[::]:9400" http2=false `
But on assigning the GPU as a resource to it, getting deployed successfully :
` ➜ VectorDBBench git:(main) ✗ cat custom_values.yaml resources: limits: nvidia.com/gpu: "1"
➜ VectorDBBench git:(main) ✗
➜ VectorDBBench git:(main) ✗ kubectl get pods
NAME READY STATUS RESTARTS AGE dcgm-exporter-8ds87 1/1 Running 0 4m16s dcgm-exporter-8qd48 1/1 Running 0 4m16s dcgm-exporter-d9hq7 1/1 Running 0 4m16s dcgm-exporter-hsbbq 1/1 Running 0 4m16s dcgm-exporter-t49tt 1/1 Running 0 4m16s dcgm-exporter-xq57b 1/1 Running 0 4m16s vectordbbench-deployment-cf7974db6-r5scd 1/1 Running 0 41h ➜ VectorDBBench git:(main) ✗
➜ VectorDBBench git:(main) ✗ kubectl logs dcgm-exporter-8ds87 2024/09/03 10:03:35 maxprocs: Leaving GOMAXPROCS=8: CPU quota undefined time="2024-09-03T10:03:35Z" level=info msg="Starting dcgm-exporter" time="2024-09-03T10:03:35Z" level=info msg="DCGM successfully initialized!" time="2024-09-03T10:03:35Z" level=info msg="Collecting DCP Metrics" time="2024-09-03T10:03:35Z" level=info msg="Falling back to metric file '/etc/dcgm-exporter/dcp-metrics-included.csv'" time="2024-09-03T10:03:35Z" level=info msg="Initializing system entities of type: GPU" time="2024-09-03T10:03:35Z" level=info msg="Not collecting NvSwitch metrics; no fields to watch for device type: 3" time="2024-09-03T10:03:35Z" level=info msg="Not collecting NvLink metrics; no fields to watch for device type: 6" time="2024-09-03T10:03:35Z" level=info msg="Not collecting CPU metrics; no fields to watch for device type: 7" time="2024-09-03T10:03:35Z" level=info msg="Not collecting CPU Core metrics; no fields to watch for device type: 8" time="2024-09-03T10:03:35Z" level=info msg="Kubernetes metrics collection enabled!" time="2024-09-03T10:03:35Z" level=info msg="Pipeline starting" time="2024-09-03T10:03:35Z" level=info msg="Starting webserver" time="2024-09-03T10:03:35Z" level=info msg="Listening on" address="[::]:9400" time="2024-09-03T10:03:35Z" level=info msg="TLS is disabled." address="[::]:9400" http2=false ➜ VectorDBBench git:(main) ✗ `
But this is blocking other services from using GPU and dedicating it to the exporter.
Thanks, Rohit Mothe