carlosedp / cluster-monitoring

Cluster monitoring stack for clusters based on Prometheus Operator
MIT License
740 stars 200 forks source link

CPU Temperature monitor giving Pod IPs instead of node IPs, so DNS names don't display #43

Closed geerlingguy closed 4 years ago

geerlingguy commented 4 years ago

I noticed the node-exporter seems to be giving back node IPs instead of Kubernetes Pod IPs, so when the dashboard is displayed in Grafana, I see my node DNS names, like worker-01, worker-02, etc.

Screenshot_2020-05-23 Kubernetes cluster monitoring (via Prometheus) - Grafana

The CPU temperature monitor data, though, uses Pod IPs instead of node IPs, so the data is a little harder to discern, since I have to manually map Pod IPs to the nodes those Pods are running on:

Screen Shot 2020-05-24 at 11 43 47 AM
geerlingguy commented 4 years ago

Should this issue be opened against the arm_exporter build repo? https://github.com/carlosedp/docker-arm_exporter

geerlingguy commented 4 years ago

Or... against the upstream rpi_exporter repo? https://github.com/lukasmalkmus/rpi_exporter

It seems like it would be the responsibility of the arm_exporter since it's K8s-specific and we might have to tweak rpi_exporter to be able to get the node IP instead of the pod IP.

geerlingguy commented 4 years ago

It looks like there's a related issue here: https://github.com/lukasmalkmus/rpi_exporter/issues/2

carlosedp commented 4 years ago

Looking thru this.

carlosedp commented 4 years ago

Can you check if https://github.com/carlosedp/cluster-monitoring/pull/45 fixes this?

geerlingguy commented 4 years ago

@carlosedp - Testing now. Thanks!

geerlingguy commented 4 years ago

Things seemed to be good, but after a few minutes I get a CrashLoopBackoff. I've been doing a ton of messing around, though, so it could be something on my end. I'm going to test out your change to #42 to make it work better out of the box, then later I'm going to rebuild the cluster and test this again.

carlosedp commented 4 years ago

Could you check the logs for the error that caused the crashloopback? Might not be related to this... but who knows :)

geerlingguy commented 4 years ago

The logs looked like the pods started up fine, that's why I think it might be some other misconfiguration I had. I'm re-building the configs now with this PR applied, and I'll see if it works this time around.

geerlingguy commented 4 years ago

@carlosedp - Strange, it's still doing the CrashLoopBackOff again with this change after a fresh rebuild of the cluster. Here are the logs from one of the crashing pods:

# kubectl logs -n monitoring arm-exporter-4ccdw -c arm-exporter
time="2020-05-25T18:15:50Z" level=info msg="Starting rpi_exporter(version=, branch=, revision=)" source="rpi_exporter.go:82"
time="2020-05-25T18:15:50Z" level=info msg="Build context(go=go1.14.2, user=, date=)" source="rpi_exporter.go:83"
time="2020-05-25T18:15:50Z" level=info msg="Listening on127.0.0.1:9243" source="rpi_exporter.go:115"
geerlingguy commented 4 years ago

Ah... it looks like that pod is fine, the kube-rbac-proxy pod is now failing:

# kubectl logs -n monitoring arm-exporter-4ccdw -c kube-rbac-proxy
I0525 18:20:07.060798       1 main.go:186] Valid token audiences: 
I0525 18:20:07.061127       1 main.go:232] Generating self signed cert as no cert is provided
I0525 18:20:21.490236       1 main.go:281] Starting TCP socket on 10.0.100.163:9243
F0525 18:20:21.490836       1 main.go:284] failed to listen on secure address: listen tcp 10.0.100.163:9243: bind: cannot assign requested address

That container can't set --secure-listen-address to the host IP, only the pod IP.

carlosedp commented 4 years ago

Ah yes, let me try another thing...

carlosedp commented 4 years ago

Latest push to the PR #45 might do it.

geerlingguy commented 4 years ago

Testing.

geerlingguy commented 4 years ago

Rbac proxy seems to be working now, here are its logs:

# kubectl logs -n monitoring arm-exporter-x65cm -c kube-rbac-proxy
I0525 19:18:51.194470       1 main.go:186] Valid token audiences: 
I0525 19:18:51.196550       1 main.go:232] Generating self signed cert as no cert is provided
I0525 19:19:06.708033       1 main.go:281] Starting TCP socket on 10.42.0.25:9243
I0525 19:19:06.709232       1 main.go:288] Listening securely on 10.42.0.25:9243
2020/05/25 19:19:11 http: proxy error: dial tcp 127.0.0.1:9243: connect: connection refused

But the other container is now failing, if I do a kubectl describe pod on it, I get:

# kubectl describe pod arm-exporter-x65cm -n monitoring
...
  Normal   Created    23s (x4 over 72s)  kubelet, turing-master  Created container arm-exporter
  Warning  Failed     22s (x4 over 71s)  kubelet, turing-master  Error: failed to create containerd task: OCI runtime create failed: container_linux.go:349: starting container process caused "exec: \"/bin/sh -c\": stat /bin/sh -c: no such file or directory": unknown

It looks like that container doesn't have sh on it.

carlosedp commented 4 years ago

Tried another strategy... just pushed. Sorry about this trial and error... it might take some time til I can get an arm board to test this myself :)

geerlingguy commented 4 years ago

Haha, no problem. I only work on the ARM-based stuff in bursts, and right now I have two clusters running for some tests, in addition to a couple other Pis, and so I'm happy to test! Working on testing this config now.

geerlingguy commented 4 years ago

They all start now, but it's still reporting the pod IPs:

Screen Shot 2020-05-25 at 3 56 26 PM

I confirmed that the deployed manifest was correct:

        volumeMounts:                                                  
        - mountPath: /etc/nodename                                     
          name: hostname                                               
          readOnly: true 

And I even confirmed the mount is reporting the correct hostname:

# kubectl exec -n monitoring arm-exporter-4wstw -c arm-exporter cat /etc/nodename
worker-04

Could there be a setting in the Proemetheus config that's using the wrong value? Nope, see next comment. It seems to be something with rpi_exporter not picking up the /etc/nodename?

geerlingguy commented 4 years ago

Grafana's getting the following data:

__name__:"rpi_cpu_temperature_celsius"
endpoint:"https"
instance:"10.42.0.26:9243"
job:"arm-exporter"
namespace:"monitoring"
pod:"arm-exporter-rzg8k"
service:"arm-exporter"

And the label is {{instance}}, so it's not something that could be tweaked on Grafana's side.

And in Prometheus, that's all it's getting too:

rpi_cpu_temperature_celsius{endpoint="https",instance="10.42.0.26:9243",job="arm-exporter",namespace="monitoring",pod="arm-exporter-rzg8k",service="arm-exporter"}

Exporter build info:

rpi_exporter_build_info{endpoint="https",goversion="go1.14.2",instance="10.42.0.26:9243",job="arm-exporter",namespace="monitoring",pod="arm-exporter-rzg8k",service="arm-exporter"}
carlosedp commented 4 years ago

The exporter should be adding a tag with node hostname... to match the other collectors. Give me some time to check on this

carlosedp commented 4 years ago

Got it now. Added a relabeling rule so the node name replaces the IP. Solution is easier than thought :)

geerlingguy commented 4 years ago

Testing it out!

geerlingguy commented 4 years ago

It works!!! Thanks!

Screen Shot 2020-05-25 at 5 42 10 PM
carlosedp commented 4 years ago

Awesome! I've set a K3s server on an ARM board I had lying here to avoid bothering you... heheeh Had a spare SDcard with a ready OS.

geerlingguy commented 4 years ago

Nice! Well, at this point all the minor nitpicks are working perfectly, I'm planning on featuring this as part of the next Turing Pi cluster video on my YouTube, so thanks for making it a perfect demo instead of a 'mostly-perfect' demo :)