NVIDIA / dcgm-exporter

NVIDIA GPU metrics exporter for Prometheus leveraging DCGM
Apache License 2.0
923 stars 159 forks source link

enable DCGM_EXPORTER_KUBERNETES and podrequestapi is avaiable but not found container and namespace label in Metrics #349

Closed Kevinz857 closed 4 months ago

Kevinz857 commented 4 months ago

What is the version?

3.3.5-3.4.1-ubuntu22.04

What happened?

The dcgm-exporter daemonset yaml is :

image

The dcgm-exporter pod log is : time="2024-07-01T11:53:44Z" level=info msg="Starting dcgm-exporter" time="2024-07-01T11:53:44Z" level=info msg="DCGM successfully initialized!" time="2024-07-01T11:53:44Z" level=info msg="Not collecting DCP metrics: This request is serviced by a module of DCGM that is not currently loaded" time="2024-07-01T11:53:44Z" level=info msg="Falling back to metric file '/etc/dcgm-exporter/dcp-metrics-included.csv'" time="2024-07-01T11:53:44Z" level=warning msg="Skipping line 20 ('DCGM_FI_PROF_GR_ENGINE_ACTIVE'): metric not enabled" time="2024-07-01T11:53:44Z" level=warning msg="Skipping line 21 ('DCGM_FI_PROF_PIPE_TENSOR_ACTIVE'): metric not enabled" time="2024-07-01T11:53:44Z" level=warning msg="Skipping line 22 ('DCGM_FI_PROF_DRAM_ACTIVE'): metric not enabled" time="2024-07-01T11:53:44Z" level=warning msg="Skipping line 23 ('DCGM_FI_PROF_PCIE_TX_BYTES'): metric not enabled" time="2024-07-01T11:53:44Z" level=warning msg="Skipping line 24 ('DCGM_FI_PROF_PCIE_RX_BYTES'): metric not enabled" time="2024-07-01T11:53:44Z" level=info msg="Initializing system entities of type: GPU" time="2024-07-01T11:53:44Z" level=info msg="Not collecting NvSwitch metrics; no fields to watch for device type: 3" time="2024-07-01T11:53:44Z" level=info msg="Not collecting NvLink metrics; no fields to watch for device type: 6" time="2024-07-01T11:53:44Z" level=info msg="Not collecting CPU metrics; no fields to watch for device type: 7" time="2024-07-01T11:53:44Z" level=info msg="Not collecting CPU Core metrics; no fields to watch for device type: 8" time="2024-07-01T11:53:44Z" level=info msg="Kubernetes metrics collection enabled!" time="2024-07-01T11:53:44Z" level=info msg="Pipeline starting" time="2024-07-01T11:53:44Z" level=info msg="Starting webserver" time="2024-07-01T11:53:44Z" level=info msg="Listening on" address="[::]:9400" time="2024-07-01T11:53:44Z" level=info msg="TLS is disabled." address="[::]:9400" http2=false

When I request metrics uri: image There is no container and namespace label

and use podrequestapi client tool ensure kubelet.sock is avaiable

./client | jq

{ "pod_resources": [ { "name": "kevin-nvidia-device-plugin-daemonset-vhbks", "namespace": "kube-system", "containers": [ { "name": "nvidia-device-plugin-ctr" } ] }, { "name": "pmf-wo-bevseg-wo-tbt-ft-gradacc-worker-9", "namespace": "usp", "containers": [ { "name": "usp-train", "devices": [ { "resource_name": "nvidia.com/gpu", "device_ids": [ "GPU-1394ddac-506f-2af4-fa65-7c12c283f5e2" ] }, { "resource_name": "nvidia.com/gpu", "device_ids": [ "GPU-c66f5c94-dba7-25dd-c206-89d3378efec5" ] }, { "resource_name": "nvidia.com/gpu", "device_ids": [ "GPU-f7a791ec-fade-58f3-b1e5-0d493527b7ae" ] }, { "resource_name": "nvidia.com/gpu", "device_ids": [ "GPU-90bd7bad-c61d-c71a-a1b3-cd1ecd79abcc" ] }, { "resource_name": "nvidia.com/gpu", "device_ids": [ "GPU-3c5b0531-d29f-e606-63cf-1563c68d0b9c" ] }, { "resource_name": "nvidia.com/gpu", "device_ids": [ "GPU-dc15b270-91b1-1dea-a718-5e204075c7da" ] },

What did you expect to happen?

Expect metrics include container and namespace label

Like this: DCGM_FI_DEV_MEM_COPY_UTIL{gpu="2",UUID="GPU-9bef6ffc-7cc4-805a-864e-16a6d0f95bd2",device="nvidia2",modelName="NVIDIA GeForce RTX 4090",Hostname="bare-20240628171759249-10-22-2-7",DCGM_FI_DRIVER_VERSION="535.161.07", namespace="xxx", pod="xxx"} 33

What is the GPU model?

NVIDIA GeForce RTX 4090

What is the environment?

Kubernetes 1.25 nvidia/k8s-device-plugin: 1.11

How did you deploy the dcgm-exporter and what is the configuration?

deploy dcgm-exporter in daemonset

apiVersion: apps/v1 kind: DaemonSet metadata: labels: app.kubernetes.io/component: dcgm-exporter app.kubernetes.io/instance: dcgm-exporter app.kubernetes.io/managed-by: Helm app.kubernetes.io/name: dcgm-exporter app.kubernetes.io/version: 3.4.2 helm.sh/chart: dcgm-exporter-3.4.2 name: kevin-dcgm-exporter namespace: cattle-monitoring-system spec: selector: matchLabels: app.kubernetes.io/component: kevin-dcgm-exporter app.kubernetes.io/instance: kevin-dcgm-exporter app.kubernetes.io/name: kevin-dcgm-exporter template: metadata: labels: app.kubernetes.io/component: kevin-dcgm-exporter app.kubernetes.io/instance: kevin-dcgm-exporter app.kubernetes.io/name: kevin-dcgm-exporter spec: containers:

How to reproduce the issue?

No response

Anything else we need to know?

No response

Kevinz857 commented 4 months ago

@nvvfedorov pls have a look if u are free, thanks a lot

Kevinz857 commented 4 months ago

Already resolve

Goorzhel commented 3 months ago

How did you resolve it, @Kevinz857? It's always good form to explain when closing a bug report.

yyang4069 commented 1 month ago

@Kevinz857 How did you resolve it.