I'm facing an issue while configurating dcgm-exporter from gpu-operator. I have 2 Kubernetes clusters - one is a cluster where GPU jobs run, and the other is used for managing the first cluster. In this case, Prometheus is not installed on the cluster where GPU jobs run, due to reduce CPU and memory resources as much can, and I'm trying to collect metrics using Prometheus from the other cluster.
I hope to set hostNetwork service for dcgm-exporter in order to get metrics from each nodes, but I can't find where it should be placed in gpu-operator helm chart (As I remembered this is useful when Prometheus is deployed outside of the Kubernetes cluster).
I found that hostNetwork can be configurable in dcgm-exporter, for example:
spec:
{{- if .Values.runtimeClassName }}
runtimeClassName: {{ .Values.runtimeClassName }}
{{- end }}
priorityClassName: {{ .Values.priorityClassName | default "system-node-critical" }}
{{- if .Values.hostNetwork }}
hostNetwork: {{ .Values.hostNetwork }}
Instead of enabling hostNetwork, would making the dcgm-exporter service a NodePort unblock you? If so, we can look into making the dcgm service configurable.
Hello, NVIDIA Team.
I'm facing an issue while configurating
dcgm-exporter
fromgpu-operator
. I have 2 Kubernetes clusters - one is a cluster where GPU jobs run, and the other is used for managing the first cluster. In this case, Prometheus is not installed on the cluster where GPU jobs run, due to reduce CPU and memory resources as much can, and I'm trying to collect metrics using Prometheus from the other cluster.I hope to set
hostNetwork
service fordcgm-exporter
in order to get metrics from each nodes, but I can't find where it should be placed ingpu-operator
helm chart (As I remembered this is useful when Prometheus is deployed outside of the Kubernetes cluster).I found that
hostNetwork
can be configurable indcgm-exporter
, for example:https://github.com/NVIDIA/dcgm-exporter/blob/4cc1d199cd3b13b6edee96af5339708f9747f499/deployment/templates/daemonset.yaml#L53
But in
gpu-operator
, only below values can be configurable and can't modifyService
in here:https://github.com/NVIDIA/gpu-operator/blob/752e8aed73c8c6141b545f56a0ed23e2a2b637a7/deployments/gpu-operator/values.yaml#L309C1-L328C20
Besides, there isn't configurable section in
DaemonSet
: https://github.com/NVIDIA/gpu-operator/blob/752e8aed73c8c6141b545f56a0ed23e2a2b637a7/assets/state-dcgm-exporter/0900_daemonset.yamlSo in this case, Could you please add
hostNetwork
option indcgmExporter
section?Thanks.