NVIDIA / gpu-monitoring-tools

Tools for monitoring NVIDIA GPUs on Linux
Apache License 2.0
1.02k stars 301 forks source link

Install broken on AKS #167

Open RaananHadar opened 3 years ago

RaananHadar commented 3 years ago

I am getting the following error when trying to install the helm chart on a standard AKS cluster:

Error: unable to build kubernetes objects from release manifest: unable to recognize "": no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1"
helm.go:81: [debug] unable to recognize "": no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1"
dualvtable commented 3 years ago

Hi @RaananHadar - have you installed the Prometheus (kube-prometheus) operator? The Prometheus operator provides the ServiceMonitor CRD, so without that, the dcgm-exporter pod will fail with the above error.

I will update our documentation to make this explicit.

RaananHadar commented 3 years ago

Thanks for the answer @dualvtable.

I've installed prometheus using a very popular helm chart from grafana called loki-stack which installs grafana+prometheus+loki. It seems to depend on the community prometheus helm chart found here. This is the documented way to install prometheus that grafana recommends so I expect alot of users to follow this route as well.

I would appreciate if you suggest a workaround for people who install prometheus this way. Many thanks.

sturfee-petrl commented 3 years ago

@RaananHadar Maybe it should be separate issue. Loki support I also installed loki-stack and I also faced the same problem.

Or rename this one please

RaananHadar commented 3 years ago

@sturfee-petrl, As noted by @dualvtable the issue is that installation is only possible with a crd from 'kube-prometheus'. I don't think that renaming the issue to 'loki support' is accurate. Maybe 'community prometheus' is more likely.