carlosedp / cluster-monitoring

Cluster monitoring stack for clusters based on Prometheus Operator
MIT License
740 stars 200 forks source link

'arm-exporter' isn't running on master node, only workers #40

Closed geerlingguy closed 4 years ago

geerlingguy commented 4 years ago

I noticed while debugging #39 that the arm-exporter DaemonSet was only running on 6 out of 7 Pi nodes. It was not running on the master node.

The master has the following taint:

Taints:             k3s-controlplane=true:NoExecute

But that doesn't seem to cause the node-exporter DaemonSet to not deploy a Pod there:

# kubectl get ds -n monitoring
NAME            DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                 AGE
node-exporter   7         7         7       7            7           kubernetes.io/os=linux        27m
arm-exporter    6         6         6       6            6           beta.kubernetes.io/arch=arm   37m

The arch is arm on all 7 Pis, so I'm not sure why the selector might influence the DS deployment.

carlosedp commented 4 years ago

I think this is because the master has a taint to not execute workloads. I'll look on how to overcome this for this pod.

geerlingguy commented 4 years ago

Yeah; I'm just wondering how node-exporter overcomes that and ensures it's also running on the master, even though it has the same taint.

geerlingguy commented 4 years ago

Supposedly this is the original fix for node-exporter-daemonset: https://github.com/coreos/prometheus-operator/pull/610/files

However, looking at the current version that's deployed I see:

      tolerations:
      - operator: Exists
geerlingguy commented 4 years ago

It looks like that change is from this commit: https://github.com/coreos/kube-prometheus/commit/e4ff0f874638c1ec27f4f7b48b88b526045ebdf1#diff-22d9db260fb0516976f57a293971723d — generated from https://github.com/coreos/kube-prometheus/blob/5b9341cad63a30f8d1d1e008eccdc93f371caab3/jsonnet/kube-prometheus/node-exporter/node-exporter.libsonnet#L80

It looks like that change came from: https://github.com/coreos/kube-prometheus/commit/272ff23cb68f9fb06411cc316ca37fd8176ccff5

So maybe that's all that's required?

geerlingguy commented 4 years ago

I'm currently doing a cluster rebuild, and I'll test this toleration once it's back up and running. In Pi time, though, that'll be an hour or so :D

geerlingguy commented 4 years ago

Adding:

    spec:
      tolerations:
      - operator: Exists

Worked! The DS is on all 7 nodes now, just like the node-exporter. I'll work on a PR momentarily.