Open homily707 opened 7 months ago
@homily707 , What do you mean when you say: "expose metric DCGM_EXP_XID_ERRORS_COUNT by override env DCGM_EXPORTER_COLLECTORS."?
Please provide steps to reproduce. We need to know your hardware.
@homily707 , What do you mean when you say: "expose metric DCGM_EXP_XID_ERRORS_COUNT by override env DCGM_EXPORTER_COLLECTORS."?
Please provide steps to reproduce. We need to know your hardware.
My meanings is that I uncomment this line # DCGM_EXP_XID_ERRORS_COUNT, gauge...
, and then modified DCGM_EXPORTER_COLLECTORS
point to that file, so the metrics should be exposed
my pod looks like this:
apiVersion: v1
kind: Pod
metadata:
name: nvidia-dcgm-exporter-5z8rg
namespace: gpu-operator
spec:
containers:
- env:
- name: DCGM_EXPORTER_LISTEN
value: :9400
- name: DCGM_EXPORTER_KUBERNETES
value: "true"
- name: DCGM_EXPORTER_COLLECTORS
value: /etc/dcgm-exporter/dcgm-metrics.csv
image: <registry>/dcgm-exporter:3.3.5-3.4.1-ubuntu22.04
imagePullPolicy: IfNotPresent
name: nvidia-dcgm-exporter
ports:
- containerPort: 9400
name: metrics
protocol: TCP
resources: {}
securityContext:
privileged: true
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/lib/kubelet/pod-resources
name: pod-gpu-resources
readOnly: true
- mountPath: /etc/dcgm-exporter/dcgm-metrics.csv
name: metrics-config
readOnly: true
subPath: dcgm-metrics.csv
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access-j48cm
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
initContainers:
- args:
- until [ -f /run/nvidia/validations/toolkit-ready ]; do echo waiting for nvidia
container stack to be setup; sleep 5; done
command:
- sh
- -c
image: <registry>/gpu-operator-validator:v23.3.2
imagePullPolicy: IfNotPresent
name: toolkit-validation
resources: {}
securityContext:
privileged: true
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /run/nvidia
mountPropagation: HostToContainer
name: run-nvidia
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access-j48cm
readOnly: true
nodeSelector:
nvidia.com/gpu.deploy.dcgm-exporter: "true"
serviceAccount: nvidia-dcgm-exporter
serviceAccountName: nvidia-dcgm-exporter
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoSchedule
key: nvidia.com/gpu
operator: Exists
- effect: NoSchedule
key: node-role.kubernetes.io/control-plane
operator: Exists
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/disk-pressure
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/memory-pressure
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/pid-pressure
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/unschedulable
operator: Exists
volumes:
- hostPath:
path: /var/lib/kubelet/pod-resources
type: ""
name: pod-gpu-resources
- hostPath:
path: /run/nvidia
type: ""
name: run-nvidia
- configMap:
defaultMode: 420
items:
- key: dcgm-metrics.csv
path: dcgm-metrics.csv
name: exporter-metrics-config-map
name: metrics-config
configmap:
apiVersion: v1
data:
dcgm-metrics.csv: |
# Format
# If line starts with a '#' it is considered a comment
# DCGM FIELD, Prometheus metric type, help message
# Clocks
DCGM_FI_DEV_SM_CLOCK, gauge, SM clock frequency (in MHz).
DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz).
# Temperature
DCGM_FI_DEV_MEMORY_TEMP, gauge, Memory temperature (in C).
DCGM_FI_DEV_GPU_TEMP, gauge, GPU temperature (in C).
# Power
DCGM_FI_DEV_POWER_USAGE, gauge, Power draw (in W).
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, counter, Total energy consumption since boot (in mJ).
# PCIE
# DCGM_FI_DEV_PCIE_TX_THROUGHPUT, counter, Total number of bytes transmitted through PCIe TX (in KB) via NVML.
# DCGM_FI_DEV_PCIE_RX_THROUGHPUT, counter, Total number of bytes received through PCIe RX (in KB) via NVML.
DCGM_FI_DEV_PCIE_REPLAY_COUNTER, counter, Total number of PCIe retries.
# Utilization (the sample period varies depending on the product)
DCGM_FI_DEV_GPU_UTIL, gauge, GPU utilization (in %).
DCGM_FI_DEV_MEM_COPY_UTIL, gauge, Memory utilization (in %).
DCGM_FI_DEV_ENC_UTIL, gauge, Encoder utilization (in %).
DCGM_FI_DEV_DEC_UTIL , gauge, Decoder utilization (in %).
# Errors and violations
DCGM_FI_DEV_XID_ERRORS, gauge, Value of the last XID error encountered.
DCGM_FI_DEV_POWER_VIOLATION, counter, Throttling duration due to power constraints (in us).
DCGM_FI_DEV_THERMAL_VIOLATION, counter, Throttling duration due to thermal constraints (in us).
DCGM_FI_DEV_SYNC_BOOST_VIOLATION, counter, Throttling duration due to sync-boost constraints (in us).
DCGM_FI_DEV_BOARD_LIMIT_VIOLATION, counter, Throttling duration due to board limit constraints (in us).
DCGM_FI_DEV_LOW_UTIL_VIOLATION, counter, Throttling duration due to low utilization (in us).
DCGM_FI_DEV_RELIABILITY_VIOLATION, counter, Throttling duration due to reliability constraints (in us).
DCGM_EXP_XID_ERRORS_COUNT, gauge, Count of XID Errors within user-specified time window (see xid-count-window-size param).
# Memory usage
DCGM_FI_DEV_FB_FREE, gauge, Framebuffer memory free (in MiB).
DCGM_FI_DEV_FB_USED, gauge, Framebuffer memory used (in MiB).
# ECC
# DCGM_FI_DEV_ECC_SBE_VOL_TOTAL, counter, Total number of single-bit volatile ECC errors.
# DCGM_FI_DEV_ECC_DBE_VOL_TOTAL, counter, Total number of double-bit volatile ECC errors.
# DCGM_FI_DEV_ECC_SBE_AGG_TOTAL, counter, Total number of single-bit persistent ECC errors.
# DCGM_FI_DEV_ECC_DBE_AGG_TOTAL, counter, Total number of double-bit persistent ECC errors.
# Retired pages
# DCGM_FI_DEV_RETIRED_SBE, counter, Total number of retired pages due to single-bit errors.
# DCGM_FI_DEV_RETIRED_DBE, counter, Total number of retired pages due to double-bit errors.
# DCGM_FI_DEV_RETIRED_PENDING, counter, Total number of pages pending retirement.
# NVLink
# DCGM_FI_DEV_NVLINK_CRC_FLIT_ERROR_COUNT_TOTAL, counter, Total number of NVLink flow-control CRC errors.
# DCGM_FI_DEV_NVLINK_CRC_DATA_ERROR_COUNT_TOTAL, counter, Total number of NVLink data CRC errors.
# DCGM_FI_DEV_NVLINK_REPLAY_ERROR_COUNT_TOTAL, counter, Total number of NVLink retries.
# DCGM_FI_DEV_NVLINK_RECOVERY_ERROR_COUNT_TOTAL, counter, Total number of NVLink recovery errors.
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL, counter, Total number of NVLink bandwidth counters for all lanes.
# DCGM_FI_DEV_NVLINK_BANDWIDTH_L0, counter, The number of bytes of active NVLink rx or tx data including both header and payload.
# VGPU License status
DCGM_FI_DEV_VGPU_LICENSE_STATUS, gauge, vGPU License status
# Remapped rows
DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS, counter, Number of remapped rows for uncorrectable errors
DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS, counter, Number of remapped rows for correctable errors
DCGM_FI_DEV_ROW_REMAP_FAILURE, gauge, Whether remapping of rows has failed
# DCP metrics
DCGM_FI_PROF_GR_ENGINE_ACTIVE, gauge, Ratio of time the graphics engine is active (in %).
# DCGM_FI_PROF_SM_ACTIVE, gauge, The ratio of cycles an SM has at least 1 warp assigned (in %).
# DCGM_FI_PROF_SM_OCCUPANCY, gauge, The ratio of number of warps resident on an SM (in %).
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Ratio of cycles the tensor (HMMA) pipe is active (in %).
DCGM_FI_PROF_DRAM_ACTIVE, gauge, Ratio of cycles the device memory interface is active sending or receiving data (in %).
# DCGM_FI_PROF_PIPE_FP64_ACTIVE, gauge, Ratio of cycles the fp64 pipes are active (in %).
# DCGM_FI_PROF_PIPE_FP32_ACTIVE, gauge, Ratio of cycles the fp32 pipes are active (in %).
# DCGM_FI_PROF_PIPE_FP16_ACTIVE, gauge, Ratio of cycles the fp16 pipes are active (in %).
DCGM_FI_PROF_PCIE_TX_BYTES, counter, The number of bytes of active pcie tx data including both header and payload.
DCGM_FI_PROF_PCIE_RX_BYTES, counter, The number of bytes of active pcie rx data including both header and payload.
payload.
kind: ConfigMap
metadata:
name: exporter-metrics-config-map
namespace: gpu-operator
Pod log:
nvidia-dcgm-exporter 2024/04/08 10:37:52 maxprocs: Leaving GOMAXPROCS=96: CPU quota undefined
nvidia-dcgm-exporter time="2024-04-08T10:37:52Z" level=info msg="Starting dcgm-exporter"
nvidia-dcgm-exporter time="2024-04-08T10:37:52Z" level=info msg="DCGM successfully initialized!"
nvidia-dcgm-exporter time="2024-04-08T10:37:52Z" level=info msg="Not collecting DCP metrics: This request is serviced by a module of DCGM that is not currently loaded"
nvidia-dcgm-exporter time="2024-04-08T10:37:52Z" level=info msg="Falling back to metric file '/etc/dcgm-exporter/dcgm-metrics.csv'"
nvidia-dcgm-exporter time="2024-04-08T10:37:52Z" level=warning msg="Skipping line 26 ('DCGM_FI_PROF_GR_ENGINE_ACTIVE'): metric not enabled"
nvidia-dcgm-exporter time="2024-04-08T10:37:52Z" level=warning msg="Skipping line 27 ('DCGM_FI_PROF_PIPE_TENSOR_ACTIVE'): metric not enabled"
nvidia-dcgm-exporter time="2024-04-08T10:37:52Z" level=warning msg="Skipping line 28 ('DCGM_FI_PROF_DRAM_ACTIVE'): metric not enabled"
nvidia-dcgm-exporter time="2024-04-08T10:37:52Z" level=warning msg="Skipping line 29 ('DCGM_FI_PROF_PCIE_TX_BYTES'): metric not enabled"
nvidia-dcgm-exporter time="2024-04-08T10:37:52Z" level=warning msg="Skipping line 30 ('DCGM_FI_PROF_PCIE_RX_BYTES'): metric not enabled"
nvidia-dcgm-exporter time="2024-04-08T10:37:52Z" level=info msg="Initializing system entities of type: GPU"
nvidia-dcgm-exporter time="2024-04-08T10:37:53Z" level=info msg="Not collecting NvSwitch metrics; no fields to watch for device type: 3"
nvidia-dcgm-exporter time="2024-04-08T10:37:53Z" level=info msg="Not collecting NvLink metrics; no fields to watch for device type: 6"
nvidia-dcgm-exporter time="2024-04-08T10:37:53Z" level=info msg="Not collecting CPU metrics; no fields to watch for device type: 7"
nvidia-dcgm-exporter time="2024-04-08T10:37:53Z" level=info msg="Not collecting CPU Core metrics; no fields to watch for device type: 8"
nvidia-dcgm-exporter time="2024-04-08T10:37:53Z" level=info msg="Kubernetes metrics collection enabled!"
nvidia-dcgm-exporter time="2024-04-08T10:37:53Z" level=info msg="Kubernetes metrics collection enabled!"
nvidia-dcgm-exporter time="2024-04-08T10:37:53Z" level=info msg="DCGM_EXP_XID_ERRORS_COUNT collector initialized"
nvidia-dcgm-exporter time="2024-04-08T10:37:53Z" level=info msg="Pipeline starting"
nvidia-dcgm-exporter time="2024-04-08T10:37:53Z" level=info msg="Starting webserver"
nvidia-dcgm-exporter time="2024-04-08T10:37:53Z" level=info msg="Listening on" address=":9400"
nvidia-dcgm-exporter time="2024-04-08T10:37:53Z" level=info msg="TLS is disabled." address=":9400" http2=false
Stream closed EOF for gpu-operator/nvidia-dcgm-exporter-5krwb (toolkit-validation)
GPU info: RTX 4090, CUDA: 12.3, Driver version: 545.29.06 no other information
@homily707 , Why do you expect to see XID errors? The XID (https://docs.nvidia.com/deploy/xid-errors/index.html) error means that the GPU has a problem.
@homily707 , Why do you expect to see XID errors? The XID (https://docs.nvidia.com/deploy/xid-errors/index.html) error means that the GPU has a problem.
@nvvfedorov I'm not expecting to see XID error, I'm expecting to see DCGM_EXP_XID_ERRORS_COUNT the metrics itself. My request is the same as #190 , so I update to the lastest version, expect to use DCGM_EXP_XID_ERRORS_COUNT. I think even if I have no Xid error, the DCGM_EXP_XID_ERRORS_COUNT should be 0 rather than no metric. But now I don't get the DCGM_EXP_XID_ERRORS_COUNT metric at all.
@homily707 , Thank you for reporting the issue. I reproduced it and put the item into the backlog.
What is the version?
3.3.5-3.4.1-ubuntu22.04
What happened?
I'm using the lastest dcgm-exporter 3.3.5-3.4.1-ubuntu22.04 , and expose metric DCGM_EXP_XID_ERRORS_COUNT by override env DCGM_EXPORTER_COLLECTORS. I can see this in log level=info msg="DCGM_EXP_XID_ERRORS_COUNT collector initialized" but when i query metrics, I can't get DCGM_EXP_XID_ERRORS_COUNT metric, what did I do wrong
What did you expect to happen?
get metric named DCGM_EXP_XID_ERRORS_COUNT
What is the GPU model?
No response
What is the environment?
No response
How did you deploy the dcgm-exporter and what is the configuration?
No response
How to reproduce the issue?
No response
Anything else we need to know?
No response