Project-HAMi / HAMi

Heterogeneous AI Computing Virtualization Middleware
http://project-hami.io/
Apache License 2.0
963 stars 199 forks source link

monitoring data node_name not exists ,GPU power usage is not correct in Grafana Dashboard #498

Open jiangsanyin opened 2 months ago

jiangsanyin commented 2 months ago

Environment:

Please provide an in-depth description of the question you have: (1)I had installed HAMi successfully, and it works well when runing vGPU task. from port 31993, I can get monitoring information as followed: 82fe171447300e39046946a6085c277

(2)I deployed dcgm-exporter  by runing “kubectl -n monitoring create -f https://raw.githubusercontent.com/NVIDIA/dcgm-exporter/master/dcgm-exporter.yaml”, and changed the type of svc/dcqm-exporter to NodePort from ClusterIP: 1a6dd7bda42efd8d2986c2601ca3301

(3)I have deployed Prometheus 2.36.1 in binary mode and made the following configurations: fef412db1fec2d4a750ac2123fb73ed Targets page from Promethus proves prometheus has already collected monitoring data: 92ec28a5d9c38ed773fb4819024bb5c

(4)I deployed Grafana v8.5.5 in k8s-1.23.10 cluster, A data source named ALL in Grafana was created: f507d8e4152ffbb4dfcc444caef9792

(5)I imported a dashboard(https://github.com/Project-HAMi/HAMi/blob/master/docs/gpu-dashboard.json), but some of the data presented was inaccurate or was not existed. for example, "nodename" in the upper left corner has no data, and the value of "GPU power usage" is not inaccurate(My GPU is a NVIDIA A10, whose GPU power usage is 150W) image

What do you think about this question?: (1)A friend name "凤" from Internet share me this dashboard "https://grafana.com/grafana/dashboards/21833-hami-vgpu-dashboard/", but the problems mentioned above still exist. (2)I founded ${node_name} was used in the two dashboard mentioned above, but ${node_name} is NULL. I don't know what's going wrong, please help

Nimbus318 commented 2 months ago

After attempting to deploy this dashboard myself, I encountered similar issues. By comparing the original metrics, I noticed the following:

  1. Missing node_name Label: The issue with the missing node_name label appears to be related to the following configuration:

    - source_labels: [__meta_kubernetes_pod_node_name]
      regex: (.*)
      target_label: node_name
      replacement: ${1}
      action: replace

    I didn't include this part of the configuration because I used ServiceMonitor directly, which means the metrics don't have the node_name label in my setup. In my environment, the default label is Hostname. To resolve this, you can follow the Add Prometheus Custom Metric Configuration section in the dashboard documentation. Alternatively, after importing the dashboard, you can modify the Variables in the Settings to match your environment.

  2. Panel Options Type Configuration: The second issue is related to the panel's Options. The Type is set to Range, but it should be Instance. Setting it to Range aggregates all the data within the selected time frame, resulting in abnormally large values. Changing the Type to Instance should correct this and display the data as expected. CleanShot 2024-09-20 at 17 29 23@2x

Hope this helps resolve the issues you're experiencing!

Nimbus318 commented 2 months ago

@jiangsanyin Regarding the Add Prometheus Custom Metric Configuration section in the dashboard, I think there may be some problems with the configuration provided.

I am using ServiceMonitor directly, so there's no need to adjust the native Prometheus scrape configuration. If your Prometheus is created via the Operator, I highly recommend using ServiceMonitor as well. Additionally, I suggest installing dcgm-exporter using the Helm Chart. You can easily configure the node_name relabeling in the Helm values file at this location: values.yaml#L109.

Finally, if you're unable to use ServiceMonitor, you might consider asking GPT for guidance on how to configure the native Prometheus scrape configuration to scrape metrics from a specified IP:Port and apply relabeling to obtain the node_name label : )

jiangsanyin commented 2 months ago

@jiangsanyin Regarding the Add Prometheus Custom Metric Configuration section in the dashboard, I think there may be some problems with the configuration provided.

I am using ServiceMonitor directly, so there's no need to adjust the native Prometheus scrape configuration. If your Prometheus is created via the Operator, I highly recommend using ServiceMonitor as well. Additionally, I suggest installing dcgm-exporter using the Helm Chart. You can easily configure the node_name relabeling in the Helm values file at this location: values.yaml#L109.

Finally, if you're unable to use ServiceMonitor, you might consider asking GPT for guidance on how to configure the native Prometheus scrape configuration to scrape metrics from a specified IP:Port and apply relabeling to obtain the node_name label : )

Thank you for your prompt reply! These problems within this issue happended to me in the course of work, I'll read your suggestion carefully next monday. ^_^

fangfenghuang commented 2 months ago

hami 31992 and 31993 metrics has no label related to node_name,so i added a label node_name for selecting gpu node metrics. label node_name is replaced from __meta_kubernetes_pod_node_name. As shown in the example, prometheus is configured using the kubernetes_sd_configs discovery mechanism (using Kubernetes "endpoints")

jiangsanyin commented 2 months ago

@Nimbus318 "I suggest installing dcgm-exporter using the Helm Chart. You can easily configure the node_name relabeling in the Helm values file at this location: values.yaml#L109."

this part helps me. Now ${Hostname} gives me the nodename of k8s cluster. I imported the dashboard from page "https://grafana.com/grafana/dashboards/21833-hami-vgpu-dashboard/" and changed the nodename variable into Hostname, then most of charts have data. image

However, there are problems in the four charts as following, because there is no entry reletaed to "Device_memory_desc_of_container" has been collected into my Prometheus, and other 3 charts don't have data because of the same reason. image

Nimbus318 commented 2 months ago

@jiangsanyin The reason you don't see Device_memory_desc_of_container in your Prometheus metrics is that this metric is exposed by the hami-device-plugin. However, Prometheus does not have a scrape rule configured to collect these metrics.

Based on your previous response, it looks like you can use the ServiceMonitor. You can try applying the following YAML configuration:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: hami-device-plugin-svc-monitor
  namespace: kube-system
spec:
  selector:
    matchLabels:
      app.kubernetes.io/component: hami-device-plugin
  namespaceSelector:
    matchNames:
      - "kube-system"
  endpoints:
  - path: /metrics
    port: monitorport
    interval: "15s"
    honorLabels: false

Based on this requirement, I think we can add a configuration similar to the following in the Hami chart:

devicePlugin:
  serviceMonitor:
    enabled: true
    interval: 15s
    honorLabels: false
    additionalLabels:
    relabelings: []

This configuration will allow users to decide whether to enable the ServiceMonitor for the devicePlugin. I might discuss with the community whether this configuration is necessary.

jiangsanyin commented 2 months ago

@jiangsanyin The reason you don't see Device_memory_desc_of_container in your Prometheus metrics is that this metric is exposed by the hami-device-plugin. However, Prometheus does not have a scrape rule configured to collect these metrics.

Based on your previous response, it looks like you can use the ServiceMonitor. You can try applying the following YAML configuration:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: hami-device-plugin-svc-monitor
  namespace: kube-system
spec:
  selector:
    matchLabels:
      app.kubernetes.io/component: hami-device-plugin
  namespaceSelector:
    matchNames:
      - "kube-system"
  endpoints:
  - path: /metrics
    port: monitorport
    interval: "15s"
    honorLabels: false

Based on this requirement, I think we can add a configuration similar to the following in the Hami chart:

devicePlugin:
  serviceMonitor:
    enabled: true
    interval: 15s
    honorLabels: false
    additionalLabels:
    relabelings: []

This configuration will allow users to decide whether to enable the ServiceMonitor for the devicePlugin. I might discuss with the community whether this configuration is necessary.

Thanks, your reply works for me. awsome!