Open hive74 opened 1 month ago
@hive74, try using honorLabels: true
. The DCGM-Exporter and Prometheus scraper assign the same labels (namespace, pod, and container) to each metric. The honorLabels: true
configuration helps to resolve the conflict.
@nvvfedorov thanks for reply.
I've tried, but there is no effect.
prometheus-server
with annotations: nothing has changed (prometheus shows only pod/service of itself)
kube-prometheus-stack
without annotations: nothing has changed (prometheus shows only pod/service of itself)
DCGM_FI_DEV_FB_USED{DCGM_FI_DRIVER_VERSION="550.54.15", Hostname="k8s-gpu1", app_kubernetes_io_component="dcgm-exporter", app_kubernetes_io_instance="dcgm-gpu-exporter", app_kubernetes_io_managed_by="Helm", app_kubernetes_io_name="dcgm-exporter", app_kubernetes_io_version="3.4.2", device="nvidia0", gpu="0", helm_sh_chart="dcgm-exporter-3.4.2", instance="172.16.140.23:9400", job="kubernetes-service-endpoints", modelName="NVIDIA RTX A5000", namespace="gpu-exporter", node="k8s-gpu1", service="dcgm-gpu-exporter-dcgm-exporter"}
DCGM_FI_DEV_FB_USED{DCGM_FI_DRIVER_VERSION="550.54.15", Hostname="k8s-gpu1", app_kubernetes_io_component="dcgm-exporter", app_kubernetes_io_instance="dcgm-gpu-exporter", app_kubernetes_io_name="dcgm-exporter", controller_revision_hash="ddc84c56b", device="nvidia0", gpu="0", instance="172.16.140.23:9400", job="kubernetes-pods", modelName="NVIDIA RTX A5000", namespace="gpu-exporter", node="k8s-gpu1", pod="dcgm-gpu-exporter-dcgm-exporter-n2r46", pod_template_generation="17"}
tried some kube-prometheus-stack
configs:
additionalScrapeConfigs: #[]
- job_name: gpu-metrics
scrape_interval: 1s
metrics_path: /metrics
scheme: http
kubernetes_sd_configs:
- role: endpoints
namespaces:
names:
- gpu-operator
relabel_configs:
- source_labels: [__meta_kubernetes_endpoints_name]
action: drop
regex: .*-node-feature-discovery-master
- source_labels: [__meta_kubernetes_pod_node_name]
action: replace
target_label: kubernetes_node
also I tried in dcgm-exporter
extraEnv: []
- name: "DCGM_EXPORTER_KUBERNETES"
value: "true"
- name: "DCGM_EXPORTER_KUBERNETES_GPU_ID_TYPE"
value: "device-name"
how do dcgm-exporter works? is there command via nvidia-smi
, for example, to get gpu utilization per pod for all namespaces? or dcgm-exporter dont use nvidia-smi?
And is trouble on dcgm-side or prometheus-side? (that dcgm-exporter shows metrics only selfpod\service). Can I get metrics manually for another namespace pods at least?
@hive74, I think the issue lies in the Prometheus configuration. To confirm whether my assumption is correct, please run the following command within the DCGM-exporter pod: curl -v http://localhost:9400/metrics
. If you see metrics with "namespace", "pod", and "container" other than DCGM-exporter itself, that means the DCGM-exporter works as expected, and you need to check your Prometheus configuration.
@nvvfedorov
its good that we can check metrics without prometheus, but
kubectl get endpoints -n gpu-exporter
NAME ENDPOINTS AGE
endpoints/dcgm-gpu-exporter-dcgm-exporter 172.16.140.23:9400 5d17h
endpoints/nvidia-gpu-exporter 172.16.140.36:9835 7d17h
trying
curl 172.16.140.23:9400/metrics
and get
# HELP DCGM_FI_DEV_GPU_UTIL GPU utilization (in %).
# TYPE DCGM_FI_DEV_GPU_UTIL gauge
DCGM_FI_DEV_GPU_UTIL{gpu="0",UUID="GPU-7ad09335-aadb-19c3-e3a8",device="nvidia0",modelName="NVIDIA RTX A5000",Hostname="k8s-gpu1",DCGM_FI_DRIVER_VERSION="550.54.15"} 0
# HELP DCGM_FI_DEV_FB_USED Framebuffer memory used (in MiB).
# TYPE DCGM_FI_DEV_FB_USED gauge
DCGM_FI_DEV_FB_USED{gpu="0",UUID="GPU-7ad09335-aadb-19c3-e3a8",device="nvidia0",modelName="NVIDIA RTX A5000",Hostname="k8s-gpu1",DCGM_FI_DRIVER_VERSION="550.54.15"} 21383
and there is no namespace
, pod
and container
in any metrics
my current dcgm-exporter values.yaml
image:
repository: nvidia/dcgm-exporter
pullPolicy: IfNotPresent
tag: 3.3.6-3.4.2-ubuntu22.04
# Image pull secrets for container images
imagePullSecrets: []
# Overrides the chart's name
nameOverride: ""
# Overrides the chart's computed fullname
fullnameOverride: ""
# Overrides the deployment namespace
namespaceOverride: ""
# Defines the runtime class that will be used by the pod
runtimeClassName: "nvidia"
# Defines serviceAccount names for components.
serviceAccount:
# Specifies whether a service account should be created
create: true
# Annotations to add to the service account
annotations: {}
# The name of the service account to use.
# If not set and create is true, a name is generated using the fullname template
name:
rollingUpdate:
# Specifies maximum number of DaemonSet pods that can be unavailable during the update
maxUnavailable: 1
# Specifies maximum number of nodes with an existing available DaemonSet pod that can have an updated DaemonSet pod during during an update
maxSurge: 0
# Labels to be added to dcgm-exporter pods
podLabels: {}
# Annotations to be added to dcgm-exporter pods
podAnnotations: #{}
# Using this annotation which is required for prometheus scraping
prometheus.io/scrape: "true"
prometheus.io/port: "9400"
# The SecurityContext for the dcgm-exporter pods
podSecurityContext: {}
# fsGroup: 2000
# The SecurityContext for the dcgm-exporter containers
securityContext:
privileged: true
# readOnlyRootFilesystem: true
# Defines the dcgm-exporter service
service:
# When enabled, the helm chart will create service
enable: true
type: ClusterIP
port: 9400
address: ":9400"
# Annotations to add to the service
annotations: #{}
prometheus.io/scrape: "true"
prometheus.io/port: "9400"
prometheus.io/path: "metrics"
# Allows to control pod resources
resources: {}
# limits:
# cpu: 100m
# memory: 128Mi
# requests:
# cpu: 100m
# memory: 128Mi
serviceMonitor:
enabled: true
interval: 15s
honorLabels: true
additionalLabels: {}
nodeSelector: #{}
node.kubernetes.io/type: gpu
tolerations: #[]
- effect: NoSchedule
key: node-type
operator: Equal
value: gpu
affinity: {}
#nodeAffinity:
# requiredDuringSchedulingIgnoredDuringExecution:
# nodeSelectorTerms:
# - matchExpressions:
# - key: nvidia-gpu
# operator: Exists
extraHostVolumes: []
#- name: host-binaries
# hostPath: /opt/bin
extraConfigMapVolumes: []
#- name: exporter-metrics-volume
# configMap:
# name: exporter-metrics-config-map
extraVolumeMounts: []
#- name: host-binaries
# mountPath: /opt/bin
# readOnly: true
extraEnv: []
# - name: "DCGM_EXPORTER_KUBERNETES"
# value: "true"
# - name: "DCGM_EXPORTER_KUBERNETES_GPU_ID_TYPE"
# value: "device-name"
#- name: EXTRA_VAR
# value: "TheStringValue"
# Path to the kubelet socket for /pod-resources
#kubeletPath: "/var/lib/kubelet/pod-resources"
kubeletPath: "/opt/kubelet/pod-resources"
and pod logs
kubectl logs pod/dcgm-gpu-exporter-dcgm-exporter-n2r46 -n gpu-exporter
2024/07/23 03:37:51 maxprocs: Leaving GOMAXPROCS=12: CPU quota undefined
time="2024-07-23T03:37:51Z" level=info msg="Starting dcgm-exporter"
time="2024-07-23T03:37:51Z" level=info msg="DCGM successfully initialized!"
time="2024-07-23T03:37:51Z" level=info msg="Collecting DCP Metrics"
time="2024-07-23T03:37:51Z" level=info msg="Falling back to metric file '/etc/dcgm-exporter/dcp-metrics-included.csv'"
time="2024-07-23T03:37:51Z" level=info msg="Initializing system entities of type: GPU"
time="2024-07-23T03:37:51Z" level=info msg="Not collecting NvSwitch metrics; no fields to watch for device type: 3"
time="2024-07-23T03:37:51Z" level=info msg="Not collecting NvLink metrics; no fields to watch for device type: 6"
time="2024-07-23T03:37:51Z" level=info msg="Not collecting CPU metrics; no fields to watch for device type: 7"
time="2024-07-23T03:37:51Z" level=info msg="Not collecting CPU Core metrics; no fields to watch for device type: 8"
time="2024-07-23T03:37:51Z" level=info msg="Kubernetes metrics collection enabled!"
time="2024-07-23T03:37:51Z" level=info msg="Starting webserver"
time="2024-07-23T03:37:51Z" level=info msg="Pipeline starting"
time="2024-07-23T03:37:51Z" level=info msg="Listening on" address="[::]:9400"
time="2024-07-23T03:37:51Z" level=info msg="TLS is disabled." address="[::]:9400" http2=false
@hive74 , Do you have nvidia-device-plugin installed on your kubernetes cluster? This component is a source of information about namespaced, pod and containers assigned to GPUs.
@nvvfedorov
Yes, nvidia-device-plugin
is in any namespace
kubectl get pods -A | grep "nvidia"
gpu-exporter nvidia-gpu-exporter-c7kz7 1/1 Running 0 10d
jupyterhub release-name-nvidia-device-plugin-55ssh 1/1 Running 6 (66d ago) 408d
Can I check information about namespaces, pod and containers via nvidia-device-plugin
? To confirm that plugin works correctly.
Additionally, I checked helm chart via helm install --dry-run
and found that dcgm-exporter create Role
and RoleBinding
Its ok? I thought that need more access like ClusterRole
and ClusterRoleBinding
---
# Source: dcgm-exporter/templates/role.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: dcgm-exporter-read-cm
namespace: gpu-exporter
labels:
helm.sh/chart: dcgm-exporter-3.4.2
app.kubernetes.io/name: dcgm-exporter
app.kubernetes.io/instance: dcgm-gpu-exporter
app.kubernetes.io/version: "3.4.2"
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/component: "dcgm-exporter"
rules:
- apiGroups: [""]
resources: ["configmaps"]
resourceNames: ["exporter-metrics-config-map"]
verbs: ["get"]
---
# Source: dcgm-exporter/templates/rolebinding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: dcgm-gpu-exporter-dcgm-exporter
namespace: gpu-exporter
labels:
helm.sh/chart: dcgm-exporter-3.4.2
app.kubernetes.io/name: dcgm-exporter
app.kubernetes.io/instance: dcgm-gpu-exporter
app.kubernetes.io/version: "3.4.2"
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/component: "dcgm-exporter"
subjects:
- kind: ServiceAccount
name: dcgm-gpu-exporter-dcgm-exporter
namespace: gpu-exporter
roleRef:
kind: Role
name: dcgm-exporter-read-cm
apiGroup: rbac.authorization.k8s.io
@hive74 , If you have access to the K8S node, where you run the workload, can you try to build https://github.com/k8stopologyawareschedwg/podresourcesapi-tools/tree/main and run the client on the K8S node? Unfortunately, kubectl doesn't provide commands to work with "k8s.io/kubelet/pkg/apis/podresources/v1alpha1" API :(
@nvvfedorov
dcgm-exporter
dont work via v1alpha1
, did you mean it?
because when I try to deploy dcgm-exporter
via helm and get error:
helm upgrade --cleanup-on-fail --install dcgm-gpu-exporter charts/dcgm-exporter-3.4.2.tgz -n gpu-exporter --version=0.1.0 --values values.yaml
Release "dcgm-gpu-exporter" does not exist. Installing it now.
Error: unable to build kubernetes objects from release manifest: resource mapping not found for name: "dcgm-gpu-exporter-dcgm-exporter" namespace: "gpu-exporter" from "": no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1"
ensure CRDs are installed first
To fix it, I've installed prometheus-operator-crd which just has v1alpha1.monitoring.coreos.com
kubectl get apiservices
NAME SERVICE AVAILABLE AGE
v1alpha1.monitoring.coreos.com Local True 12d
is it key-reason? What can I do that avoid to install prometheus-operator-crd
which has v1alpha1
? Or is there any crd for dcgm-exporter
?
@hive74 , To view namespaces, pods, and containers that utilize a GPU, you must have the nvidia-device-plugin installed. To verify its installation, follow these steps:
I think, I found the problem:
kubelet
is in /opt/kubelet
device-plugin
is in /var/lib/kubelet
(including nvidia-gpu.sock), its default plugin-directorydcgm-exporter
is in /opt/kubelet
I tried to change dcgm-kubeletPath to "/var/lib/kubelet/pod-resources"
and get
time="2024-08-02T12:45:40Z" level=error msg="Failed to collect metrics; err: failed to transform metrics for transform 'podMapper'; err: failure connecting to '/var/lib/kubelet/pod-resources/kubelet.sock'; err: context deadline exceeded"
Can I configure dcgm-exporter for different plugin and kubelet folders? Is there may be any solution?
Ask your question
Hello, There is
prometheus-server
in namespacemonitoring
, I've installeddcgm-exporter 3.4.2
in namespacegpu-exporter
on gpu-node. I was needed to add annotationsprometheus.io/scrape
,prometheus.io/port: "9400"
,prometheus.io/path: "metrics"
to dcgm pod and service. Alsoprometheus-crd
was installed in nsstack-kube
. In the end, I get metrics in prometheus, for example,DCGM_FI_DEV_GPU_UTIL
orDCGM_FI_DEV_FB_USED
, but only one pod\service (self dcgm-exporter), I need to get metrics of other pods on this node (I have nvidia-smi exporter pod/service in same namespace and some different pods in any namespaces). How can I get it?I tried to install
kube-prometheus-stack
in namespacestack-kube
withserviceMonitorSelectorNilUsesHelmValues: false
, I didnt have to create annotations for dcgm-exporter pod/service, but again get metrics of only dcgm-exporter pod/servicevalues.yaml
dcgm-exporter