NVIDIA / dcgm-exporter

NVIDIA GPU metrics exporter for Prometheus leveraging DCGM
Apache License 2.0
833 stars 150 forks source link

dcgm-exporter dont show metrics from other namespaces and pods k8s #363

Open hive74 opened 1 month ago

hive74 commented 1 month ago

Ask your question

Hello, There is prometheus-server in namespace monitoring, I've installed dcgm-exporter 3.4.2 in namespace gpu-exporter on gpu-node. I was needed to add annotations prometheus.io/scrape, prometheus.io/port: "9400", prometheus.io/path: "metrics" to dcgm pod and service. Also prometheus-crd was installed in ns stack-kube. In the end, I get metrics in prometheus, for example, DCGM_FI_DEV_GPU_UTIL or DCGM_FI_DEV_FB_USED, but only one pod\service (self dcgm-exporter), I need to get metrics of other pods on this node (I have nvidia-smi exporter pod/service in same namespace and some different pods in any namespaces). How can I get it?

I tried to install kube-prometheus-stack in namespace stack-kube with serviceMonitorSelectorNilUsesHelmValues: false, I didnt have to create annotations for dcgm-exporter pod/service, but again get metrics of only dcgm-exporter pod/service

DCGM_FI_DEV_FB_USED{DCGM_FI_DRIVER_VERSION="550.54.15", Hostname="k8s-gpu1", app_kubernetes_io_component="dcgm-exporter", app_kubernetes_io_instance="dcgm-gpu-exporter", app_kubernetes_io_managed_by="Helm", app_kubernetes_io_name="dcgm-exporter", app_kubernetes_io_version="3.4.2", device="nvidia0", gpu="0", helm_sh_chart="dcgm-exporter-3.4.2", instance="172.16.140.27:9400", job="kubernetes-service-endpoints", modelName="NVIDIA RTX A5000", namespace="gpu-exporter", node="k8s-gpu1", service="dcgm-gpu-exporter-dcgm-exporter"}
DCGM_FI_DEV_FB_USED{DCGM_FI_DRIVER_VERSION="550.54.15", Hostname="k8s-gpu1", app_kubernetes_io_component="dcgm-exporter", app_kubernetes_io_instance="dcgm-gpu-exporter", app_kubernetes_io_name="dcgm-exporter", controller_revision_hash="85db9c866c", device="nvidia0", gpu="0", instance="172.16.140.27:9400", job="kubernetes-pods", modelName="NVIDIA RTX A5000", namespace="gpu-exporter", node="k8s-gpu1", pod="dcgm-gpu-exporter-dcgm-exporter-g54m2", pod_template_generation="7"}

values.yaml dcgm-exporter

# Labels to be added to dcgm-exporter pods
podLabels: {}

# Annotations to be added to dcgm-exporter pods
podAnnotations: #{}
# Using this annotation which is required for prometheus scraping
  prometheus.io/scrape: "true"
  prometheus.io/port: "9400"

# The SecurityContext for the dcgm-exporter pods
podSecurityContext: {}
  # fsGroup: 2000

# The SecurityContext for the dcgm-exporter containers
securityContext:
  privileged: true
  # readOnlyRootFilesystem: true

# Defines the dcgm-exporter service
service:
  # When enabled, the helm chart will create service
  enable: true
  type: ClusterIP
  port: 9400
  address: ":9400"
  # Annotations to add to the service
  annotations: #{}
    prometheus.io/scrape: "true"
    prometheus.io/port: "9400"
    prometheus.io/path: "metrics"

serviceMonitor:
  enabled: true
  interval: 15s
  honorLabels: false
  additionalLabels: {}
nvvfedorov commented 1 month ago

@hive74, try using honorLabels: true. The DCGM-Exporter and Prometheus scraper assign the same labels (namespace, pod, and container) to each metric. The honorLabels: true configuration helps to resolve the conflict.

hive74 commented 1 month ago

@nvvfedorov thanks for reply. I've tried, but there is no effect. prometheus-server with annotations: nothing has changed (prometheus shows only pod/service of itself) kube-prometheus-stack without annotations: nothing has changed (prometheus shows only pod/service of itself)

DCGM_FI_DEV_FB_USED{DCGM_FI_DRIVER_VERSION="550.54.15", Hostname="k8s-gpu1", app_kubernetes_io_component="dcgm-exporter", app_kubernetes_io_instance="dcgm-gpu-exporter", app_kubernetes_io_managed_by="Helm", app_kubernetes_io_name="dcgm-exporter", app_kubernetes_io_version="3.4.2", device="nvidia0", gpu="0", helm_sh_chart="dcgm-exporter-3.4.2", instance="172.16.140.23:9400", job="kubernetes-service-endpoints", modelName="NVIDIA RTX A5000", namespace="gpu-exporter", node="k8s-gpu1", service="dcgm-gpu-exporter-dcgm-exporter"}
DCGM_FI_DEV_FB_USED{DCGM_FI_DRIVER_VERSION="550.54.15", Hostname="k8s-gpu1", app_kubernetes_io_component="dcgm-exporter", app_kubernetes_io_instance="dcgm-gpu-exporter", app_kubernetes_io_name="dcgm-exporter", controller_revision_hash="ddc84c56b", device="nvidia0", gpu="0", instance="172.16.140.23:9400", job="kubernetes-pods", modelName="NVIDIA RTX A5000", namespace="gpu-exporter", node="k8s-gpu1", pod="dcgm-gpu-exporter-dcgm-exporter-n2r46", pod_template_generation="17"}

tried some kube-prometheus-stack configs:

    additionalScrapeConfigs: #[]
      - job_name: gpu-metrics
        scrape_interval: 1s
        metrics_path: /metrics
        scheme: http
        kubernetes_sd_configs:
        - role: endpoints
          namespaces:
            names:
            - gpu-operator
        relabel_configs:
        - source_labels: [__meta_kubernetes_endpoints_name]
          action: drop
          regex: .*-node-feature-discovery-master
        - source_labels: [__meta_kubernetes_pod_node_name]
          action: replace
          target_label: kubernetes_node

also I tried in dcgm-exporter

extraEnv: []
  - name: "DCGM_EXPORTER_KUBERNETES"
    value: "true"
  - name: "DCGM_EXPORTER_KUBERNETES_GPU_ID_TYPE"
    value: "device-name"

how do dcgm-exporter works? is there command via nvidia-smi, for example, to get gpu utilization per pod for all namespaces? or dcgm-exporter dont use nvidia-smi?

And is trouble on dcgm-side or prometheus-side? (that dcgm-exporter shows metrics only selfpod\service). Can I get metrics manually for another namespace pods at least?

nvvfedorov commented 1 month ago

@hive74, I think the issue lies in the Prometheus configuration. To confirm whether my assumption is correct, please run the following command within the DCGM-exporter pod: curl -v http://localhost:9400/metrics. If you see metrics with "namespace", "pod", and "container" other than DCGM-exporter itself, that means the DCGM-exporter works as expected, and you need to check your Prometheus configuration.

hive74 commented 1 month ago

@nvvfedorov its good that we can check metrics without prometheus, but kubectl get endpoints -n gpu-exporter

NAME                                        ENDPOINTS            AGE
endpoints/dcgm-gpu-exporter-dcgm-exporter   172.16.140.23:9400   5d17h
endpoints/nvidia-gpu-exporter               172.16.140.36:9835   7d17h

trying curl 172.16.140.23:9400/metrics and get

# HELP DCGM_FI_DEV_GPU_UTIL GPU utilization (in %).
# TYPE DCGM_FI_DEV_GPU_UTIL gauge
DCGM_FI_DEV_GPU_UTIL{gpu="0",UUID="GPU-7ad09335-aadb-19c3-e3a8",device="nvidia0",modelName="NVIDIA RTX A5000",Hostname="k8s-gpu1",DCGM_FI_DRIVER_VERSION="550.54.15"} 0
# HELP DCGM_FI_DEV_FB_USED Framebuffer memory used (in MiB).
# TYPE DCGM_FI_DEV_FB_USED gauge
DCGM_FI_DEV_FB_USED{gpu="0",UUID="GPU-7ad09335-aadb-19c3-e3a8",device="nvidia0",modelName="NVIDIA RTX A5000",Hostname="k8s-gpu1",DCGM_FI_DRIVER_VERSION="550.54.15"} 21383

and there is no namespace, pod and container in any metrics

my current dcgm-exporter values.yaml

image:
  repository: nvidia/dcgm-exporter
  pullPolicy: IfNotPresent
  tag: 3.3.6-3.4.2-ubuntu22.04
# Image pull secrets for container images
imagePullSecrets: []

# Overrides the chart's name
nameOverride: ""

# Overrides the chart's computed fullname
fullnameOverride: ""

# Overrides the deployment namespace
namespaceOverride: ""

# Defines the runtime class that will be used by the pod
runtimeClassName: "nvidia"
# Defines serviceAccount names for components.
serviceAccount:
  # Specifies whether a service account should be created
  create: true
  # Annotations to add to the service account
  annotations: {}
  # The name of the service account to use.
  # If not set and create is true, a name is generated using the fullname template
  name:

rollingUpdate:
  # Specifies maximum number of DaemonSet pods that can be unavailable during the update
  maxUnavailable: 1
  # Specifies maximum number of nodes with an existing available DaemonSet pod that can have an updated DaemonSet pod during during an update
  maxSurge: 0

# Labels to be added to dcgm-exporter pods
podLabels: {}

# Annotations to be added to dcgm-exporter pods
podAnnotations: #{}
# Using this annotation which is required for prometheus scraping
  prometheus.io/scrape: "true"
  prometheus.io/port: "9400"

# The SecurityContext for the dcgm-exporter pods
podSecurityContext: {}
  # fsGroup: 2000

# The SecurityContext for the dcgm-exporter containers
securityContext:
  privileged: true
  # readOnlyRootFilesystem: true

# Defines the dcgm-exporter service
service:
  # When enabled, the helm chart will create service
  enable: true
  type: ClusterIP
  port: 9400
  address: ":9400"
  # Annotations to add to the service
  annotations: #{}
    prometheus.io/scrape: "true"
    prometheus.io/port: "9400"
    prometheus.io/path: "metrics"

# Allows to control pod resources
resources: {}
  # limits:
  #   cpu: 100m
  #   memory: 128Mi
  # requests:
  #   cpu: 100m
  #   memory: 128Mi

serviceMonitor:
  enabled: true
  interval: 15s
  honorLabels: true
  additionalLabels: {}

nodeSelector: #{}
  node.kubernetes.io/type: gpu

tolerations: #[]
  - effect: NoSchedule
    key: node-type
    operator: Equal
    value: gpu

affinity: {}
  #nodeAffinity:
  #  requiredDuringSchedulingIgnoredDuringExecution:
  #    nodeSelectorTerms:
  #    - matchExpressions:
  #      - key: nvidia-gpu
  #        operator: Exists

extraHostVolumes: []
#- name: host-binaries
#  hostPath: /opt/bin

extraConfigMapVolumes: []
#- name: exporter-metrics-volume
#  configMap:
#    name: exporter-metrics-config-map

extraVolumeMounts: []
#- name: host-binaries
#  mountPath: /opt/bin
#  readOnly: true

extraEnv: []
#  - name: "DCGM_EXPORTER_KUBERNETES"
#    value: "true"
#  - name: "DCGM_EXPORTER_KUBERNETES_GPU_ID_TYPE"
#    value: "device-name"
#- name: EXTRA_VAR
#  value: "TheStringValue"

# Path to the kubelet socket for /pod-resources
#kubeletPath: "/var/lib/kubelet/pod-resources"
kubeletPath: "/opt/kubelet/pod-resources"

and pod logs

kubectl logs pod/dcgm-gpu-exporter-dcgm-exporter-n2r46 -n gpu-exporter
2024/07/23 03:37:51 maxprocs: Leaving GOMAXPROCS=12: CPU quota undefined
time="2024-07-23T03:37:51Z" level=info msg="Starting dcgm-exporter"
time="2024-07-23T03:37:51Z" level=info msg="DCGM successfully initialized!"
time="2024-07-23T03:37:51Z" level=info msg="Collecting DCP Metrics"
time="2024-07-23T03:37:51Z" level=info msg="Falling back to metric file '/etc/dcgm-exporter/dcp-metrics-included.csv'"
time="2024-07-23T03:37:51Z" level=info msg="Initializing system entities of type: GPU"
time="2024-07-23T03:37:51Z" level=info msg="Not collecting NvSwitch metrics; no fields to watch for device type: 3"
time="2024-07-23T03:37:51Z" level=info msg="Not collecting NvLink metrics; no fields to watch for device type: 6"
time="2024-07-23T03:37:51Z" level=info msg="Not collecting CPU metrics; no fields to watch for device type: 7"
time="2024-07-23T03:37:51Z" level=info msg="Not collecting CPU Core metrics; no fields to watch for device type: 8"
time="2024-07-23T03:37:51Z" level=info msg="Kubernetes metrics collection enabled!"
time="2024-07-23T03:37:51Z" level=info msg="Starting webserver"
time="2024-07-23T03:37:51Z" level=info msg="Pipeline starting"
time="2024-07-23T03:37:51Z" level=info msg="Listening on" address="[::]:9400"
time="2024-07-23T03:37:51Z" level=info msg="TLS is disabled." address="[::]:9400" http2=false
nvvfedorov commented 1 month ago

@hive74 , Do you have nvidia-device-plugin installed on your kubernetes cluster? This component is a source of information about namespaced, pod and containers assigned to GPUs.

hive74 commented 1 month ago

@nvvfedorov Yes, nvidia-device-plugin is in any namespace

kubectl get pods -A | grep "nvidia"
gpu-exporter         nvidia-gpu-exporter-c7kz7                     1/1     Running                  0                   10d
jupyterhub           release-name-nvidia-device-plugin-55ssh       1/1     Running                  6 (66d ago)         408d

Can I check information about namespaces, pod and containers via nvidia-device-plugin? To confirm that plugin works correctly.

hive74 commented 1 month ago

Additionally, I checked helm chart via helm install --dry-run and found that dcgm-exporter create Role and RoleBinding Its ok? I thought that need more access like ClusterRole and ClusterRoleBinding

---
# Source: dcgm-exporter/templates/role.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: dcgm-exporter-read-cm
  namespace: gpu-exporter
  labels:
    helm.sh/chart: dcgm-exporter-3.4.2
    app.kubernetes.io/name: dcgm-exporter
    app.kubernetes.io/instance: dcgm-gpu-exporter
    app.kubernetes.io/version: "3.4.2"
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/component: "dcgm-exporter"
rules:
- apiGroups: [""]
  resources: ["configmaps"]
  resourceNames: ["exporter-metrics-config-map"]
  verbs: ["get"]
---
# Source: dcgm-exporter/templates/rolebinding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: dcgm-gpu-exporter-dcgm-exporter
  namespace: gpu-exporter
  labels:
    helm.sh/chart: dcgm-exporter-3.4.2
    app.kubernetes.io/name: dcgm-exporter
    app.kubernetes.io/instance: dcgm-gpu-exporter
    app.kubernetes.io/version: "3.4.2"
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/component: "dcgm-exporter"
subjects:
- kind: ServiceAccount
  name: dcgm-gpu-exporter-dcgm-exporter
  namespace: gpu-exporter
roleRef:
  kind: Role 
  name: dcgm-exporter-read-cm
  apiGroup: rbac.authorization.k8s.io
nvvfedorov commented 1 month ago

@hive74 , If you have access to the K8S node, where you run the workload, can you try to build https://github.com/k8stopologyawareschedwg/podresourcesapi-tools/tree/main and run the client on the K8S node? Unfortunately, kubectl doesn't provide commands to work with "k8s.io/kubelet/pkg/apis/podresources/v1alpha1" API :(

hive74 commented 1 month ago

@nvvfedorov dcgm-exporter dont work via v1alpha1, did you mean it?

because when I try to deploy dcgm-exporter via helm and get error:

helm upgrade --cleanup-on-fail --install dcgm-gpu-exporter charts/dcgm-exporter-3.4.2.tgz -n gpu-exporter --version=0.1.0 --values values.yaml
Release "dcgm-gpu-exporter" does not exist. Installing it now.
Error: unable to build kubernetes objects from release manifest: resource mapping not found for name: "dcgm-gpu-exporter-dcgm-exporter" namespace: "gpu-exporter" from "": no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1"
ensure CRDs are installed first

To fix it, I've installed prometheus-operator-crd which just has v1alpha1.monitoring.coreos.com

kubectl get apiservices
NAME                                   SERVICE   AVAILABLE   AGE
v1alpha1.monitoring.coreos.com         Local     True        12d

is it key-reason? What can I do that avoid to install prometheus-operator-crd which has v1alpha1? Or is there any crd for dcgm-exporter?

nvvfedorov commented 1 month ago

@hive74 , To view namespaces, pods, and containers that utilize a GPU, you must have the nvidia-device-plugin installed. To verify its installation, follow these steps:

  1. Check the logs of nvidia-device-plugin.
  2. You may run the https://github.com/k8stopologyawareschedwg/podresourcesapi-tools/tree/main utility on your K8S node to see if nvidia-device-plugin is working properly.
hive74 commented 1 month ago

I think, I found the problem:

I tried to change dcgm-kubeletPath to "/var/lib/kubelet/pod-resources" and get

time="2024-08-02T12:45:40Z" level=error msg="Failed to collect metrics; err: failed to transform metrics for transform 'podMapper'; err: failure connecting to '/var/lib/kubelet/pod-resources/kubelet.sock'; err: context deadline exceeded"

Can I configure dcgm-exporter for different plugin and kubelet folders? Is there may be any solution?