Open jiangsanyin opened 2 months ago
After attempting to deploy this dashboard myself, I encountered similar issues. By comparing the original metrics, I noticed the following:
Missing node_name Label: The issue with the missing node_name label appears to be related to the following configuration:
- source_labels: [__meta_kubernetes_pod_node_name]
regex: (.*)
target_label: node_name
replacement: ${1}
action: replace
I didn't include this part of the configuration because I used ServiceMonitor directly, which means the metrics don't have the node_name label in my setup. In my environment, the default label is Hostname. To resolve this, you can follow the Add Prometheus Custom Metric Configuration section in the dashboard documentation. Alternatively, after importing the dashboard, you can modify the Variables in the Settings to match your environment.
Panel Options Type Configuration: The second issue is related to the panel's Options. The Type is set to Range, but it should be Instance. Setting it to Range aggregates all the data within the selected time frame, resulting in abnormally large values. Changing the Type to Instance should correct this and display the data as expected.
Hope this helps resolve the issues you're experiencing!
@jiangsanyin Regarding the Add Prometheus Custom Metric Configuration section in the dashboard, I think there may be some problems with the configuration provided.
I am using ServiceMonitor
directly, so there's no need to adjust the native Prometheus scrape configuration. If your Prometheus is created via the Operator, I highly recommend using ServiceMonitor
as well. Additionally, I suggest installing dcgm-exporter
using the Helm Chart. You can easily configure the node_name
relabeling in the Helm values file at this location: values.yaml#L109.
Finally, if you're unable to use ServiceMonitor
, you might consider asking GPT for guidance on how to configure the native Prometheus scrape configuration to scrape metrics from a specified IP:Port
and apply relabeling to obtain the node_name
label : )
@jiangsanyin Regarding the Add Prometheus Custom Metric Configuration section in the dashboard, I think there may be some problems with the configuration provided.
I am using
ServiceMonitor
directly, so there's no need to adjust the native Prometheus scrape configuration. If your Prometheus is created via the Operator, I highly recommend usingServiceMonitor
as well. Additionally, I suggest installingdcgm-exporter
using the Helm Chart. You can easily configure thenode_name
relabeling in the Helm values file at this location: values.yaml#L109.Finally, if you're unable to use
ServiceMonitor
, you might consider asking GPT for guidance on how to configure the native Prometheus scrape configuration to scrape metrics from a specified IP:Port and apply relabeling to obtain thenode_name
label : )
Thank you for your prompt reply! These problems within this issue happended to me in the course of work, I'll read your suggestion carefully next monday. ^_^
hami 31992 and 31993 metrics has no label related to node_name,so i added a label node_name for selecting gpu node metrics. label node_name is replaced from __meta_kubernetes_pod_node_name. As shown in the example, prometheus is configured using the kubernetes_sd_configs discovery mechanism (using Kubernetes "endpoints")
@Nimbus318 "I suggest installing dcgm-exporter using the Helm Chart. You can easily configure the node_name relabeling in the Helm values file at this location: values.yaml#L109."
this part helps me. Now ${Hostname} gives me the nodename of k8s cluster. I imported the dashboard from page "https://grafana.com/grafana/dashboards/21833-hami-vgpu-dashboard/" and changed the nodename variable into Hostname, then most of charts have data.
However, there are problems in the four charts as following, because there is no entry reletaed to "Device_memory_desc_of_container" has been collected into my Prometheus, and other 3 charts don't have data because of the same reason.
@jiangsanyin The reason you don't see Device_memory_desc_of_container in your Prometheus metrics is that this metric is exposed by the hami-device-plugin. However, Prometheus does not have a scrape rule configured to collect these metrics.
Based on your previous response, it looks like you can use the ServiceMonitor. You can try applying the following YAML configuration:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: hami-device-plugin-svc-monitor
namespace: kube-system
spec:
selector:
matchLabels:
app.kubernetes.io/component: hami-device-plugin
namespaceSelector:
matchNames:
- "kube-system"
endpoints:
- path: /metrics
port: monitorport
interval: "15s"
honorLabels: false
Based on this requirement, I think we can add a configuration similar to the following in the Hami chart:
devicePlugin:
serviceMonitor:
enabled: true
interval: 15s
honorLabels: false
additionalLabels:
relabelings: []
This configuration will allow users to decide whether to enable the ServiceMonitor for the devicePlugin. I might discuss with the community whether this configuration is necessary.
@jiangsanyin The reason you don't see Device_memory_desc_of_container in your Prometheus metrics is that this metric is exposed by the hami-device-plugin. However, Prometheus does not have a scrape rule configured to collect these metrics.
Based on your previous response, it looks like you can use the ServiceMonitor. You can try applying the following YAML configuration:
apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: hami-device-plugin-svc-monitor namespace: kube-system spec: selector: matchLabels: app.kubernetes.io/component: hami-device-plugin namespaceSelector: matchNames: - "kube-system" endpoints: - path: /metrics port: monitorport interval: "15s" honorLabels: false
Based on this requirement, I think we can add a configuration similar to the following in the Hami chart:
devicePlugin: serviceMonitor: enabled: true interval: 15s honorLabels: false additionalLabels: relabelings: []
This configuration will allow users to decide whether to enable the ServiceMonitor for the devicePlugin. I might discuss with the community whether this configuration is necessary.
Thanks, your reply works for me. awsome!
Environment:
Please provide an in-depth description of the question you have: (1)I had installed HAMi successfully, and it works well when runing vGPU task. from port 31993, I can get monitoring information as followed:
(2)I deployed dcgm-exporter by runing “kubectl -n monitoring create -f https://raw.githubusercontent.com/NVIDIA/dcgm-exporter/master/dcgm-exporter.yaml”, and changed the type of svc/dcqm-exporter to NodePort from ClusterIP:
(3)I have deployed Prometheus 2.36.1 in binary mode and made the following configurations: Targets page from Promethus proves prometheus has already collected monitoring data:
(4)I deployed Grafana v8.5.5 in k8s-1.23.10 cluster, A data source named ALL in Grafana was created:
(5)I imported a dashboard(https://github.com/Project-HAMi/HAMi/blob/master/docs/gpu-dashboard.json), but some of the data presented was inaccurate or was not existed. for example, "nodename" in the upper left corner has no data, and the value of "GPU power usage" is not inaccurate(My GPU is a NVIDIA A10, whose GPU power usage is 150W)
What do you think about this question?: (1)A friend name "凤" from Internet share me this dashboard "https://grafana.com/grafana/dashboards/21833-hami-vgpu-dashboard/", but the problems mentioned above still exist. (2)I founded ${node_name} was used in the two dashboard mentioned above, but ${node_name} is NULL. I don't know what's going wrong, please help