NVIDIA / gpu-operator

NVIDIA GPU Operator creates, configures, and manages GPUs in Kubernetes
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html
Apache License 2.0
1.87k stars 303 forks source link

[Feature Request] Add hostNetwork mode for dcgmExporter #1086

Open jslouisyou opened 3 weeks ago

jslouisyou commented 3 weeks ago

Hello, NVIDIA Team.

I'm facing an issue while configurating dcgm-exporter from gpu-operator. I have 2 Kubernetes clusters - one is a cluster where GPU jobs run, and the other is used for managing the first cluster. In this case, Prometheus is not installed on the cluster where GPU jobs run, due to reduce CPU and memory resources as much can, and I'm trying to collect metrics using Prometheus from the other cluster.

I hope to set hostNetwork service for dcgm-exporter in order to get metrics from each nodes, but I can't find where it should be placed in gpu-operator helm chart (As I remembered this is useful when Prometheus is deployed outside of the Kubernetes cluster).

I found that hostNetwork can be configurable in dcgm-exporter, for example:

    spec:
      {{- if .Values.runtimeClassName }}
      runtimeClassName: {{ .Values.runtimeClassName }}
      {{- end }}
      priorityClassName: {{ .Values.priorityClassName | default "system-node-critical" }}
      {{- if .Values.hostNetwork }}
      hostNetwork: {{ .Values.hostNetwork }}

https://github.com/NVIDIA/dcgm-exporter/blob/4cc1d199cd3b13b6edee96af5339708f9747f499/deployment/templates/daemonset.yaml#L53

But in gpu-operator, only below values can be configurable and can't modify Service in here:

dcgmExporter:
  enabled: true
  repository: nvcr.io/nvidia/k8s
  image: dcgm-exporter
  version: 3.3.8-3.6.0-ubuntu22.04
  imagePullPolicy: IfNotPresent
  env:
    - name: DCGM_EXPORTER_LISTEN
      value: ":9400"
    - name: DCGM_EXPORTER_KUBERNETES
      value: "true"
    - name: DCGM_EXPORTER_COLLECTORS
      value: "/etc/dcgm-exporter/dcp-metrics-included.csv"
  resources: {}
  serviceMonitor:
    enabled: false
    interval: 15s
    honorLabels: false
    additionalLabels: {}
    relabelings: []

https://github.com/NVIDIA/gpu-operator/blob/752e8aed73c8c6141b545f56a0ed23e2a2b637a7/deployments/gpu-operator/values.yaml#L309C1-L328C20

Besides, there isn't configurable section in DaemonSet: https://github.com/NVIDIA/gpu-operator/blob/752e8aed73c8c6141b545f56a0ed23e2a2b637a7/assets/state-dcgm-exporter/0900_daemonset.yaml

So in this case, Could you please add hostNetwork option in dcgmExporter section?

Thanks.

tariq1890 commented 5 days ago

Instead of enabling hostNetwork, would making the dcgm-exporter service a NodePort unblock you? If so, we can look into making the dcgm service configurable.