NVIDIA / dcgm-exporter

NVIDIA GPU metrics exporter for Prometheus leveraging DCGM
Apache License 2.0
907 stars 157 forks source link

can't get DCGM_EXP_XID_ERRORS_COUNT metrics #310

Open homily707 opened 7 months ago

homily707 commented 7 months ago

What is the version?

3.3.5-3.4.1-ubuntu22.04

What happened?

I'm using the lastest dcgm-exporter 3.3.5-3.4.1-ubuntu22.04 , and expose metric DCGM_EXP_XID_ERRORS_COUNT by override env DCGM_EXPORTER_COLLECTORS. I can see this in log level=info msg="DCGM_EXP_XID_ERRORS_COUNT collector initialized" but when i query metrics, I can't get DCGM_EXP_XID_ERRORS_COUNT metric, what did I do wrong

What did you expect to happen?

get metric named DCGM_EXP_XID_ERRORS_COUNT

What is the GPU model?

No response

What is the environment?

No response

How did you deploy the dcgm-exporter and what is the configuration?

No response

How to reproduce the issue?

No response

Anything else we need to know?

No response

nvvfedorov commented 7 months ago

@homily707 , What do you mean when you say: "expose metric DCGM_EXP_XID_ERRORS_COUNT by override env DCGM_EXPORTER_COLLECTORS."?

Please provide steps to reproduce. We need to know your hardware.

homily707 commented 7 months ago

@homily707 , What do you mean when you say: "expose metric DCGM_EXP_XID_ERRORS_COUNT by override env DCGM_EXPORTER_COLLECTORS."?

Please provide steps to reproduce. We need to know your hardware.

My meanings is that I uncomment this line # DCGM_EXP_XID_ERRORS_COUNT, gauge..., and then modified DCGM_EXPORTER_COLLECTORS point to that file, so the metrics should be exposed

my pod looks like this:

apiVersion: v1
kind: Pod
metadata:
  name: nvidia-dcgm-exporter-5z8rg
  namespace: gpu-operator
spec:
  containers:
  - env:
    - name: DCGM_EXPORTER_LISTEN
      value: :9400
    - name: DCGM_EXPORTER_KUBERNETES
      value: "true"
    - name: DCGM_EXPORTER_COLLECTORS
      value: /etc/dcgm-exporter/dcgm-metrics.csv
    image: <registry>/dcgm-exporter:3.3.5-3.4.1-ubuntu22.04
    imagePullPolicy: IfNotPresent
    name: nvidia-dcgm-exporter
    ports:
    - containerPort: 9400
      name: metrics
      protocol: TCP
    resources: {}
    securityContext:
      privileged: true
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/lib/kubelet/pod-resources
      name: pod-gpu-resources
      readOnly: true
    - mountPath: /etc/dcgm-exporter/dcgm-metrics.csv
      name: metrics-config
      readOnly: true
      subPath: dcgm-metrics.csv
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-j48cm
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  initContainers:
  - args:
    - until [ -f /run/nvidia/validations/toolkit-ready ]; do echo waiting for nvidia
      container stack to be setup; sleep 5; done
    command:
    - sh
    - -c
    image: <registry>/gpu-operator-validator:v23.3.2
    imagePullPolicy: IfNotPresent
    name: toolkit-validation
    resources: {}
    securityContext:
      privileged: true
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /run/nvidia
      mountPropagation: HostToContainer
      name: run-nvidia
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-j48cm
      readOnly: true
  nodeSelector:
    nvidia.com/gpu.deploy.dcgm-exporter: "true"
  serviceAccount: nvidia-dcgm-exporter
  serviceAccountName: nvidia-dcgm-exporter
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoSchedule
    key: nvidia.com/gpu
    operator: Exists
  - effect: NoSchedule
    key: node-role.kubernetes.io/control-plane
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/disk-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/memory-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/pid-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/unschedulable
    operator: Exists
  volumes:
  - hostPath:
      path: /var/lib/kubelet/pod-resources
      type: ""
    name: pod-gpu-resources
  - hostPath:
      path: /run/nvidia
      type: ""
    name: run-nvidia
  - configMap:
      defaultMode: 420
      items:
      - key: dcgm-metrics.csv
        path: dcgm-metrics.csv
      name: exporter-metrics-config-map
    name: metrics-config

configmap:

apiVersion: v1
data:
  dcgm-metrics.csv: |
    # Format
    # If line starts with a '#' it is considered a comment
    # DCGM FIELD, Prometheus metric type, help message

    # Clocks
    DCGM_FI_DEV_SM_CLOCK,  gauge, SM clock frequency (in MHz).
    DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz).

    # Temperature
    DCGM_FI_DEV_MEMORY_TEMP, gauge, Memory temperature (in C).
    DCGM_FI_DEV_GPU_TEMP,    gauge, GPU temperature (in C).

    # Power
    DCGM_FI_DEV_POWER_USAGE,              gauge, Power draw (in W).
    DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, counter, Total energy consumption since boot (in mJ).

    # PCIE
    # DCGM_FI_DEV_PCIE_TX_THROUGHPUT,  counter, Total number of bytes transmitted through PCIe TX (in KB) via NVML.
    # DCGM_FI_DEV_PCIE_RX_THROUGHPUT,  counter, Total number of bytes received through PCIe RX (in KB) via NVML.
    DCGM_FI_DEV_PCIE_REPLAY_COUNTER, counter, Total number of PCIe retries.

    # Utilization (the sample period varies depending on the product)
    DCGM_FI_DEV_GPU_UTIL,      gauge, GPU utilization (in %).
    DCGM_FI_DEV_MEM_COPY_UTIL, gauge, Memory utilization (in %).
    DCGM_FI_DEV_ENC_UTIL,      gauge, Encoder utilization (in %).
    DCGM_FI_DEV_DEC_UTIL ,     gauge, Decoder utilization (in %).

    # Errors and violations
    DCGM_FI_DEV_XID_ERRORS,            gauge,   Value of the last XID error encountered.
    DCGM_FI_DEV_POWER_VIOLATION,       counter, Throttling duration due to power constraints (in us).
    DCGM_FI_DEV_THERMAL_VIOLATION,     counter, Throttling duration due to thermal constraints (in us).
    DCGM_FI_DEV_SYNC_BOOST_VIOLATION,  counter, Throttling duration due to sync-boost constraints (in us).
    DCGM_FI_DEV_BOARD_LIMIT_VIOLATION, counter, Throttling duration due to board limit constraints (in us).
    DCGM_FI_DEV_LOW_UTIL_VIOLATION,    counter, Throttling duration due to low utilization (in us).
    DCGM_FI_DEV_RELIABILITY_VIOLATION, counter, Throttling duration due to reliability constraints (in us).
    DCGM_EXP_XID_ERRORS_COUNT,         gauge,   Count of XID Errors within user-specified time window (see xid-count-window-size param).

    # Memory usage
    DCGM_FI_DEV_FB_FREE, gauge, Framebuffer memory free (in MiB).
    DCGM_FI_DEV_FB_USED, gauge, Framebuffer memory used (in MiB).

    # ECC
    # DCGM_FI_DEV_ECC_SBE_VOL_TOTAL, counter, Total number of single-bit volatile ECC errors.
    # DCGM_FI_DEV_ECC_DBE_VOL_TOTAL, counter, Total number of double-bit volatile ECC errors.
    # DCGM_FI_DEV_ECC_SBE_AGG_TOTAL, counter, Total number of single-bit persistent ECC errors.
    # DCGM_FI_DEV_ECC_DBE_AGG_TOTAL, counter, Total number of double-bit persistent ECC errors.

    # Retired pages
    # DCGM_FI_DEV_RETIRED_SBE,     counter, Total number of retired pages due to single-bit errors.
    # DCGM_FI_DEV_RETIRED_DBE,     counter, Total number of retired pages due to double-bit errors.
    # DCGM_FI_DEV_RETIRED_PENDING, counter, Total number of pages pending retirement.

    # NVLink
    # DCGM_FI_DEV_NVLINK_CRC_FLIT_ERROR_COUNT_TOTAL, counter, Total number of NVLink flow-control CRC errors.
    # DCGM_FI_DEV_NVLINK_CRC_DATA_ERROR_COUNT_TOTAL, counter, Total number of NVLink data CRC errors.
    # DCGM_FI_DEV_NVLINK_REPLAY_ERROR_COUNT_TOTAL,   counter, Total number of NVLink retries.
    # DCGM_FI_DEV_NVLINK_RECOVERY_ERROR_COUNT_TOTAL, counter, Total number of NVLink recovery errors.
    DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL,            counter, Total number of NVLink bandwidth counters for all lanes.
    # DCGM_FI_DEV_NVLINK_BANDWIDTH_L0,               counter, The number of bytes of active NVLink rx or tx data including both header and payload.

    # VGPU License status
    DCGM_FI_DEV_VGPU_LICENSE_STATUS, gauge, vGPU License status

    # Remapped rows
    DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS, counter, Number of remapped rows for uncorrectable errors
    DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS,   counter, Number of remapped rows for correctable errors
    DCGM_FI_DEV_ROW_REMAP_FAILURE,           gauge,   Whether remapping of rows has failed

    # DCP metrics
    DCGM_FI_PROF_GR_ENGINE_ACTIVE,   gauge, Ratio of time the graphics engine is active (in %).
    # DCGM_FI_PROF_SM_ACTIVE,          gauge, The ratio of cycles an SM has at least 1 warp assigned (in %).
    # DCGM_FI_PROF_SM_OCCUPANCY,       gauge, The ratio of number of warps resident on an SM (in %).
    DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Ratio of cycles the tensor (HMMA) pipe is active (in %).
    DCGM_FI_PROF_DRAM_ACTIVE,        gauge, Ratio of cycles the device memory interface is active sending or receiving data (in %).
    # DCGM_FI_PROF_PIPE_FP64_ACTIVE,   gauge, Ratio of cycles the fp64 pipes are active (in %).
    # DCGM_FI_PROF_PIPE_FP32_ACTIVE,   gauge, Ratio of cycles the fp32 pipes are active (in %).
    # DCGM_FI_PROF_PIPE_FP16_ACTIVE,   gauge, Ratio of cycles the fp16 pipes are active (in %).
    DCGM_FI_PROF_PCIE_TX_BYTES,      counter, The number of bytes of active pcie tx data including both header and payload.
    DCGM_FI_PROF_PCIE_RX_BYTES,      counter, The number of bytes of active pcie rx data including both header and payload.
  payload.
kind: ConfigMap
metadata:
  name: exporter-metrics-config-map
  namespace: gpu-operator

Pod log:

nvidia-dcgm-exporter 2024/04/08 10:37:52 maxprocs: Leaving GOMAXPROCS=96: CPU quota undefined
nvidia-dcgm-exporter time="2024-04-08T10:37:52Z" level=info msg="Starting dcgm-exporter"
nvidia-dcgm-exporter time="2024-04-08T10:37:52Z" level=info msg="DCGM successfully initialized!"
nvidia-dcgm-exporter time="2024-04-08T10:37:52Z" level=info msg="Not collecting DCP metrics: This request is serviced by a module of DCGM that is not currently loaded"
nvidia-dcgm-exporter time="2024-04-08T10:37:52Z" level=info msg="Falling back to metric file '/etc/dcgm-exporter/dcgm-metrics.csv'"
nvidia-dcgm-exporter time="2024-04-08T10:37:52Z" level=warning msg="Skipping line 26 ('DCGM_FI_PROF_GR_ENGINE_ACTIVE'): metric not enabled"
nvidia-dcgm-exporter time="2024-04-08T10:37:52Z" level=warning msg="Skipping line 27 ('DCGM_FI_PROF_PIPE_TENSOR_ACTIVE'): metric not enabled"
nvidia-dcgm-exporter time="2024-04-08T10:37:52Z" level=warning msg="Skipping line 28 ('DCGM_FI_PROF_DRAM_ACTIVE'): metric not enabled"
nvidia-dcgm-exporter time="2024-04-08T10:37:52Z" level=warning msg="Skipping line 29 ('DCGM_FI_PROF_PCIE_TX_BYTES'): metric not enabled"
nvidia-dcgm-exporter time="2024-04-08T10:37:52Z" level=warning msg="Skipping line 30 ('DCGM_FI_PROF_PCIE_RX_BYTES'): metric not enabled"
nvidia-dcgm-exporter time="2024-04-08T10:37:52Z" level=info msg="Initializing system entities of type: GPU"
nvidia-dcgm-exporter time="2024-04-08T10:37:53Z" level=info msg="Not collecting NvSwitch metrics; no fields to watch for device type: 3"
nvidia-dcgm-exporter time="2024-04-08T10:37:53Z" level=info msg="Not collecting NvLink metrics; no fields to watch for device type: 6"
nvidia-dcgm-exporter time="2024-04-08T10:37:53Z" level=info msg="Not collecting CPU metrics; no fields to watch for device type: 7"
nvidia-dcgm-exporter time="2024-04-08T10:37:53Z" level=info msg="Not collecting CPU Core metrics; no fields to watch for device type: 8"
nvidia-dcgm-exporter time="2024-04-08T10:37:53Z" level=info msg="Kubernetes metrics collection enabled!"
nvidia-dcgm-exporter time="2024-04-08T10:37:53Z" level=info msg="Kubernetes metrics collection enabled!"
nvidia-dcgm-exporter time="2024-04-08T10:37:53Z" level=info msg="DCGM_EXP_XID_ERRORS_COUNT collector initialized"
nvidia-dcgm-exporter time="2024-04-08T10:37:53Z" level=info msg="Pipeline starting"
nvidia-dcgm-exporter time="2024-04-08T10:37:53Z" level=info msg="Starting webserver"
nvidia-dcgm-exporter time="2024-04-08T10:37:53Z" level=info msg="Listening on" address=":9400"
nvidia-dcgm-exporter time="2024-04-08T10:37:53Z" level=info msg="TLS is disabled." address=":9400" http2=false
Stream closed EOF for gpu-operator/nvidia-dcgm-exporter-5krwb (toolkit-validation)

GPU info: RTX 4090, CUDA: 12.3, Driver version: 545.29.06 no other information

nvvfedorov commented 6 months ago

@homily707 , Why do you expect to see XID errors? The XID (https://docs.nvidia.com/deploy/xid-errors/index.html) error means that the GPU has a problem.

homily707 commented 6 months ago

@homily707 , Why do you expect to see XID errors? The XID (https://docs.nvidia.com/deploy/xid-errors/index.html) error means that the GPU has a problem.

@nvvfedorov I'm not expecting to see XID error, I'm expecting to see DCGM_EXP_XID_ERRORS_COUNT the metrics itself. My request is the same as #190 , so I update to the lastest version, expect to use DCGM_EXP_XID_ERRORS_COUNT. I think even if I have no Xid error, the DCGM_EXP_XID_ERRORS_COUNT should be 0 rather than no metric. But now I don't get the DCGM_EXP_XID_ERRORS_COUNT metric at all.

nvvfedorov commented 6 months ago

@homily707 , Thank you for reporting the issue. I reproduced it and put the item into the backlog.