Closed dmrub closed 6 months ago
Hi @dmrub. Feel free to propose a PR against this repo. ~or a merge request against https://gitlab.com/nvidia/cloud-native/cnt-docs (preferred).~
Update: We have migrated our documentation repo to GitHub and merge requests are no longer needed.
@dmrub, so sorry for the long delay in addressing this issue. Now it's addressed.
Thanks very much for taking the time to open the issue and for using NVIDIA software!
IMHO there is a bug in the documentation for setting up Prometheus: kubernetes/kube-prometheus.rst .
The following rule should collect all Kubernetes endpoints and use them to scrape metrics:
However, not all endpoints in the gpu-operator namespace provide Prometheus metrics. In particular, node-feature-discovery-master has only one gRPC endpoint on port 8080, which cannot be scraped. I have changed this rule as follows to fix the problem: