carlosedp / cluster-monitoring

Cluster monitoring stack for clusters based on Prometheus Operator
MIT License
740 stars 201 forks source link

[arm-exporter] No data in Prometheus & Grafana #92

Closed Razerban closed 3 years ago

Razerban commented 3 years ago

Describe the bug Cannot get my nodes' metrics using arm-exporter.

Troubleshooting

  1. Which kind of Kubernetes cluster are you using? Single node bare metal Kubernetes (v1.19-2) cluster on a RPi 4 8GB (not using any of k3s, minikube or microk8s).
  2. Are all pods in "Running" state? Yes.
  3. You cluster already works with other applications that have HTTP/HTTPS? Yes.
  4. If you enabled persistence, do your cluster already provides persistent storage (PVs) to other applications? I didn't enable persistence. I use hostPath for the volumes.
  5. Does it provides dynamic storage thru StorageClass? No.

Customizations

  1. Did you customize vars.jsonnet? Put the contents below:
{
      name: 'armExporter',
      enabled: true,
      file: import 'modules/arm_exporter.jsonnet',
}
  1. Did you change any other file? No.

What did you see when trying to access Grafana and Prometheus web GUI No data in prometheus and Grafana.

Additional context Logs extracted from arm-exporter container running in the pod:

time="2020-09-22T15:58:19Z" level=info msg="Starting rpi_exporter(version=, branch=, revision=)" source="rpi_exporter.go:82"
time="2020-09-22T15:58:19Z" level=info msg="Build context(go=go1.14.8, user=, date=)" source="rpi_exporter.go:83"
time="2020-09-22T15:58:19Z" level=info msg="Listening on127.0.0.1:9243" source="rpi_exporter.go:115"
time="2020-09-22T15:58:48Z" level=error msg="gpu collector failed after 0.018082s: exit status 255" source="collector.go:142"
time="2020-09-22T15:59:18Z" level=error msg="gpu collector failed after 0.002620s: exit status 255" source="collector.go:142"
time="2020-09-22T15:59:48Z" level=error msg="gpu collector failed after 0.002244s: exit status 255" source="collector.go:142"
time="2020-09-22T16:00:18Z" level=error msg="gpu collector failed after 0.001709s: exit status 255" source="collector.go:142"
time="2020-09-22T16:00:48Z" level=error msg="gpu collector failed after 0.003235s: exit status 255" source="collector.go:142"
time="2020-09-22T16:01:18Z" level=error msg="gpu collector failed after 0.001515s: exit status 255" source="collector.go:142"
time="2020-09-22T16:01:48Z" level=error msg="gpu collector failed after 0.002127s: exit status 255" source="collector.go:142"
Razerban commented 3 years ago

In addition to the details already mentioned, when I connect to the containers inside the pod and try to run /opt/vc/bin/vcgencmd measure_tmp i get the following error:

/ # /opt/vc/bin/vcgencmd measure_temp
VCHI initialization failed

User details:

/ # whoami
root
/ # groups
root bin daemon sys adm disk wheel floppy dialout tape video
carlosedp commented 3 years ago

There were some discussions in the original code for the exporter. Check https://github.com/lukasmalkmus/rpi_exporter. Maybe some incompatibility with the Rpi4. I don't have one to test.

fonsecas72 commented 3 years ago

Just trying to help. I think it's not incompatibility with rpi4 as I been using it (4 rpis with 4gb each, k3s 18, Ubuntu 18 and 20)

carlosedp commented 3 years ago

Just to be sure @fonsecas72, the dashboard reports the temperature correctly?

fonsecas72 commented 3 years ago

Yup it seems

image

cat /sys/class/thermal/thermal_zone0/temp
55017
carlosedp commented 3 years ago

Thanks! @razerbann might be something block the access for Prometheus to collect the metrics in your PIs or something in it's Linux that does not generate the metrics, like missing Kernel modules.

pascal71 commented 3 years ago

/ # /opt/vc/bin/vcgencmd measure_temp VCHI initialization failed

vgencmd needs access to /dev in the container. This is forbidden by default. You will need to elevate the privileges the container is running with within your POD.

pascal71 commented 3 years ago

I have submitted a PR for it

carlosedp commented 3 years ago

Fixed by #97

Razerban commented 3 years ago

I can confirm that the issue is no longer present. Thank you for your help !