The dcgm-exporter pod log is :
time="2024-07-01T11:53:44Z" level=info msg="Starting dcgm-exporter"
time="2024-07-01T11:53:44Z" level=info msg="DCGM successfully initialized!"
time="2024-07-01T11:53:44Z" level=info msg="Not collecting DCP metrics: This request is serviced by a module of DCGM that is not currently loaded"
time="2024-07-01T11:53:44Z" level=info msg="Falling back to metric file '/etc/dcgm-exporter/dcp-metrics-included.csv'"
time="2024-07-01T11:53:44Z" level=warning msg="Skipping line 20 ('DCGM_FI_PROF_GR_ENGINE_ACTIVE'): metric not enabled"
time="2024-07-01T11:53:44Z" level=warning msg="Skipping line 21 ('DCGM_FI_PROF_PIPE_TENSOR_ACTIVE'): metric not enabled"
time="2024-07-01T11:53:44Z" level=warning msg="Skipping line 22 ('DCGM_FI_PROF_DRAM_ACTIVE'): metric not enabled"
time="2024-07-01T11:53:44Z" level=warning msg="Skipping line 23 ('DCGM_FI_PROF_PCIE_TX_BYTES'): metric not enabled"
time="2024-07-01T11:53:44Z" level=warning msg="Skipping line 24 ('DCGM_FI_PROF_PCIE_RX_BYTES'): metric not enabled"
time="2024-07-01T11:53:44Z" level=info msg="Initializing system entities of type: GPU"
time="2024-07-01T11:53:44Z" level=info msg="Not collecting NvSwitch metrics; no fields to watch for device type: 3"
time="2024-07-01T11:53:44Z" level=info msg="Not collecting NvLink metrics; no fields to watch for device type: 6"
time="2024-07-01T11:53:44Z" level=info msg="Not collecting CPU metrics; no fields to watch for device type: 7"
time="2024-07-01T11:53:44Z" level=info msg="Not collecting CPU Core metrics; no fields to watch for device type: 8"
time="2024-07-01T11:53:44Z" level=info msg="Kubernetes metrics collection enabled!"
time="2024-07-01T11:53:44Z" level=info msg="Pipeline starting"
time="2024-07-01T11:53:44Z" level=info msg="Starting webserver"
time="2024-07-01T11:53:44Z" level=info msg="Listening on" address="[::]:9400"
time="2024-07-01T11:53:44Z" level=info msg="TLS is disabled." address="[::]:9400" http2=false
When I request metrics uri:There is no container and namespace label
and use podrequestapi client tool ensure kubelet.sock is avaiable
What is the version?
3.3.5-3.4.1-ubuntu22.04
What happened?
The dcgm-exporter daemonset yaml is :
The dcgm-exporter pod log is : time="2024-07-01T11:53:44Z" level=info msg="Starting dcgm-exporter" time="2024-07-01T11:53:44Z" level=info msg="DCGM successfully initialized!" time="2024-07-01T11:53:44Z" level=info msg="Not collecting DCP metrics: This request is serviced by a module of DCGM that is not currently loaded" time="2024-07-01T11:53:44Z" level=info msg="Falling back to metric file '/etc/dcgm-exporter/dcp-metrics-included.csv'" time="2024-07-01T11:53:44Z" level=warning msg="Skipping line 20 ('DCGM_FI_PROF_GR_ENGINE_ACTIVE'): metric not enabled" time="2024-07-01T11:53:44Z" level=warning msg="Skipping line 21 ('DCGM_FI_PROF_PIPE_TENSOR_ACTIVE'): metric not enabled" time="2024-07-01T11:53:44Z" level=warning msg="Skipping line 22 ('DCGM_FI_PROF_DRAM_ACTIVE'): metric not enabled" time="2024-07-01T11:53:44Z" level=warning msg="Skipping line 23 ('DCGM_FI_PROF_PCIE_TX_BYTES'): metric not enabled" time="2024-07-01T11:53:44Z" level=warning msg="Skipping line 24 ('DCGM_FI_PROF_PCIE_RX_BYTES'): metric not enabled" time="2024-07-01T11:53:44Z" level=info msg="Initializing system entities of type: GPU" time="2024-07-01T11:53:44Z" level=info msg="Not collecting NvSwitch metrics; no fields to watch for device type: 3" time="2024-07-01T11:53:44Z" level=info msg="Not collecting NvLink metrics; no fields to watch for device type: 6" time="2024-07-01T11:53:44Z" level=info msg="Not collecting CPU metrics; no fields to watch for device type: 7" time="2024-07-01T11:53:44Z" level=info msg="Not collecting CPU Core metrics; no fields to watch for device type: 8" time="2024-07-01T11:53:44Z" level=info msg="Kubernetes metrics collection enabled!" time="2024-07-01T11:53:44Z" level=info msg="Pipeline starting" time="2024-07-01T11:53:44Z" level=info msg="Starting webserver" time="2024-07-01T11:53:44Z" level=info msg="Listening on" address="[::]:9400" time="2024-07-01T11:53:44Z" level=info msg="TLS is disabled." address="[::]:9400" http2=false
When I request metrics uri: There is no container and namespace label
and use podrequestapi client tool ensure kubelet.sock is avaiable
./client | jq
{ "pod_resources": [ { "name": "kevin-nvidia-device-plugin-daemonset-vhbks", "namespace": "kube-system", "containers": [ { "name": "nvidia-device-plugin-ctr" } ] }, { "name": "pmf-wo-bevseg-wo-tbt-ft-gradacc-worker-9", "namespace": "usp", "containers": [ { "name": "usp-train", "devices": [ { "resource_name": "nvidia.com/gpu", "device_ids": [ "GPU-1394ddac-506f-2af4-fa65-7c12c283f5e2" ] }, { "resource_name": "nvidia.com/gpu", "device_ids": [ "GPU-c66f5c94-dba7-25dd-c206-89d3378efec5" ] }, { "resource_name": "nvidia.com/gpu", "device_ids": [ "GPU-f7a791ec-fade-58f3-b1e5-0d493527b7ae" ] }, { "resource_name": "nvidia.com/gpu", "device_ids": [ "GPU-90bd7bad-c61d-c71a-a1b3-cd1ecd79abcc" ] }, { "resource_name": "nvidia.com/gpu", "device_ids": [ "GPU-3c5b0531-d29f-e606-63cf-1563c68d0b9c" ] }, { "resource_name": "nvidia.com/gpu", "device_ids": [ "GPU-dc15b270-91b1-1dea-a718-5e204075c7da" ] },
What did you expect to happen?
Expect metrics include container and namespace label
Like this: DCGM_FI_DEV_MEM_COPY_UTIL{gpu="2",UUID="GPU-9bef6ffc-7cc4-805a-864e-16a6d0f95bd2",device="nvidia2",modelName="NVIDIA GeForce RTX 4090",Hostname="bare-20240628171759249-10-22-2-7",DCGM_FI_DRIVER_VERSION="535.161.07", namespace="xxx", pod="xxx"} 33
What is the GPU model?
NVIDIA GeForce RTX 4090
What is the environment?
Kubernetes 1.25 nvidia/k8s-device-plugin: 1.11
How did you deploy the dcgm-exporter and what is the configuration?
deploy dcgm-exporter in daemonset
apiVersion: apps/v1 kind: DaemonSet metadata: labels: app.kubernetes.io/component: dcgm-exporter app.kubernetes.io/instance: dcgm-exporter app.kubernetes.io/managed-by: Helm app.kubernetes.io/name: dcgm-exporter app.kubernetes.io/version: 3.4.2 helm.sh/chart: dcgm-exporter-3.4.2 name: kevin-dcgm-exporter namespace: cattle-monitoring-system spec: selector: matchLabels: app.kubernetes.io/component: kevin-dcgm-exporter app.kubernetes.io/instance: kevin-dcgm-exporter app.kubernetes.io/name: kevin-dcgm-exporter template: metadata: labels: app.kubernetes.io/component: kevin-dcgm-exporter app.kubernetes.io/instance: kevin-dcgm-exporter app.kubernetes.io/name: kevin-dcgm-exporter spec: containers:
How to reproduce the issue?
No response
Anything else we need to know?
No response