NVIDIA / gpu-monitoring-tools

Tools for monitoring NVIDIA GPUs on Linux
Apache License 2.0
1.01k stars 301 forks source link

K8s Pod/namespace information in exported fields #129

Closed geoberle closed 3 years ago

geoberle commented 3 years ago

I would like to know if the dcgm -exporter can be configured in a way that k8s pod/namespace infos of the pod currently using a GPU can be exported along with the utilization metrics.

Let me describe quickly my motivation (maybe there is another solution to achieve my goal): I would like to track which pods in a K8s cluster are using what amount of GPU resources. This could then be used for reporting/showback to different teams, capacity planning and also optimization suggestions. In a large cluster with lots of GPUs this is challenging at times

etherandrius commented 3 years ago

You should be using something like kube-state-metrics for that https://github.com/kubernetes/kube-state-metrics It exposes pod request/limit as well as node capacity/allocatable information.

geoberle commented 3 years ago

I know what you mean but as far as I can see the metrics from kube state metrics and the ones from the GPU monitoring are not connected to each other. So I know what pods/processes run on a node and also what GPUs are on that node and how well/badly they were used but I don't know which pod it was.

Please let me know if I misunderstood you or if I'm not aware of something obvious to enable this use case for my workloads.

LRocc commented 3 years ago

I know what you mean but as far as I can see the metrics from kube state metrics and the ones from the GPU monitoring are not connected to each other. So I know what pods/processes run on a node and also what GPUs are on that node and how well/badly they were used but I don't know which pod it was.

Please let me know if I misunderstood you or if I'm not aware of something obvious to enable this use case for my workloads.

i have the same exact problem, so far I found that https://developer.nvidia.com/blog/monitoring-gpus-in-kubernetes-with-dcgm/ Nvidia GPU monitoring looks like it's able to pull in the metrics with gpu/pod assignment. So far I can only see it from the screenshots on this page : https://docs.nvidia.com/datacenter/cloud-native/kubernetes/dcgme2e.html#gpu-telemetry you can see that the first Prometheus screenshot has all the needed metrics. Question is, how do I implement it? from my understanding, this feature is included in the latest version of gpu-monitoring-tools, so it looks like, in my case, it's just as simple as loading a new image onto my GPU cluster, and update Kubernetes to at least version 1.12. Hopefully someone is able to provide more details.

geoberle commented 3 years ago

turns out dcgm-exporter can do that already, it is just disabled by default if you install it via gpu-operator. You can eighter install dcgm-exporter via helm because the chart is setting up everything correctly or you modify your daemonset manually. Make sure you have the env variable DCGM_EXPORTER_KUBERNETES=true and a volumeMount to /var/lib/kubelet/pod-resources in your container spec.

containers:
      ...
        env:
        - name: DCGM_EXPORTER_KUBERNETES
          value: "true"
        ...
        volumeMounts:
        - mountPath: /var/lib/kubelet/pod-resources
          name: pod-gpu-resources
          readOnly: true

I'm going to close this issue and open one in the gpu-operator repo.