NVIDIA / cloud-native-docs

Documentation repository for NVIDIA Cloud Native Technologies
https://docs.nvidia.com/datacenter/cloud-native/
Apache License 2.0
16 stars 18 forks source link

Mistake in the documentation of the Prometheus setup #1

Closed dmrub closed 6 months ago

dmrub commented 1 year ago

IMHO there is a bug in the documentation for setting up Prometheus: kubernetes/kube-prometheus.rst .

The following rule should collect all Kubernetes endpoints and use them to scrape metrics:

additionalScrapeConfigs:
- job_name: gpu-metrics
  scrape_interval: 1s
  metrics_path: /metrics
  scheme: http
  kubernetes_sd_configs:
  - role: endpoints
    namespaces:
      names:
      - gpu-operator
  relabel_configs:
  - source_labels: [__meta_kubernetes_pod_node_name]
    action: replace
    target_label: kubernetes_node

However, not all endpoints in the gpu-operator namespace provide Prometheus metrics. In particular, node-feature-discovery-master has only one gRPC endpoint on port 8080, which cannot be scraped. I have changed this rule as follows to fix the problem:

additionalScrapeConfigs:
- job_name: gpu-metrics
  scrape_interval: 1s
  metrics_path: /metrics
  scheme: http
  kubernetes_sd_configs:
  - role: endpoints
    namespaces:
      names:
      - gpu-operator
  relabel_configs:
  - source_labels: [__meta_kubernetes_endpoints_name]
    action: drop
    regex: .*-node-feature-discovery-master
  - source_labels: [__meta_kubernetes_pod_node_name]
    action: replace
    target_label: kubernetes_node
elezar commented 1 year ago

Hi @dmrub. Feel free to propose a PR against this repo. ~or a merge request against https://gitlab.com/nvidia/cloud-native/cnt-docs (preferred).~

Update: We have migrated our documentation repo to GitHub and merge requests are no longer needed.

mikemckiernan commented 6 months ago

@dmrub, so sorry for the long delay in addressing this issue. Now it's addressed.

https://docs.nvidia.com/datacenter/cloud-native/gpu-telemetry/latest/kube-prometheus.html#about-setting-up-prometheus

Thanks very much for taking the time to open the issue and for using NVIDIA software!