carlosedp / cluster-monitoring

Cluster monitoring stack for clusters based on Prometheus Operator
MIT License
740 stars 200 forks source link

CPU Temperature alert giving 'No data' on Raspberry Pi CM3+ #39

Closed geerlingguy closed 4 years ago

geerlingguy commented 4 years ago

I'm testing this out on a Turing Pi cluster, with 7 Pi Compute Module 3+ boards.

On my Grafana dashboard, I'm seeing no data for CPU temperature:

Screen Shot 2020-05-23 at 6 33 25 PM

I'm going to dig in and see where the monitor is supposed to be running. I modified the vars.jsonnet file like so, for my cluster:

{
  _config+:: {
    namespace: 'monitoring',
  },
  // Enable or disable additional modules
  modules: [
    {
      // After deployment, run the create_gmail_auth.sh script from scripts dir.
      name: 'smtpRelay',
      enabled: false,
      file: import 'smtp_relay.jsonnet',
    },
    {
      name: 'armExporter',
      enabled: true,
      file: import 'arm_exporter.jsonnet',
    },
    {
      name: 'upsExporter',
      enabled: false,
      file: import 'ups_exporter.jsonnet',
    },
    {
      name: 'metallbExporter',
      enabled: false,
      file: import 'metallb.jsonnet',
    },
    {
      name: 'traefikExporter',
      enabled: false,
      file: import 'traefik.jsonnet',
    },
    {
      name: 'elasticExporter',
      enabled: false,
      file: import 'elasticsearch_exporter.jsonnet',
    },
  ],

  k3s: {
    enabled: true,
    master_ip: ['10.0.100.163'],
  },

  // Domain suffix for the ingresses
  suffixDomain: '10.0.100.74.nip.io',
  // If TLSingress is true, a self-signed HTTPS ingress with redirect will be created
  TLSingress: true,
  // If UseProvidedCerts is true, provided files will be used on created HTTPS ingresses.
  // Use a wildcard certificate for the domain like ex. "*.192.168.99.100.nip.io"
  UseProvidedCerts: false,
  TLSCertificate: importstr 'server.crt',
  TLSKey: importstr 'server.key',

  // Setting these to false, defaults to emptyDirs
  enablePersistence: {
    prometheus: false,
    grafana: false,
  },

  // Grafana "from" email
  grafana: {
    from_address: 'myemail@example.com',
  },
}
geerlingguy commented 4 years ago

Checking on the DaemonSet:

# kubectl get ds -n monitoring
NAME            DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                   AGE
node-exporter   7         7         7       7            7           kubernetes.io/os=linux          140m
arm-exporter    0         0         0       0            0           beta.kubernetes.io/arch=arm64   11m
geerlingguy commented 4 years ago

Also after editing the vars.jsonnet file and re-makeing everything I'm getting unrelated error:

failed: [10.0.100.163] (item=/home/pirate/cluster-monitoring/manifests/node-exporter-daemonset.yaml) => {"ansible_loop_var": "item", "changed": false, "error": 422, "item": "/home/pirate/cluster-monitoring/manifests/node-exporter-daemonset.yaml", "msg": "Failed to patch object: b'{\"kind\":\"Status\",\"apiVersion\":\"v1\",\"metadata\":{},\"status\":\"Failure\",\"message\":\"DaemonSet.apps \\\\\"node-exporter\\\\\" is invalid: spec.selector: Invalid value: v1.LabelSelector{MatchLabels:map[string]string{\\\\\"app\\\\\":\\\\\"node-exporter\\\\\", \\\\\"app.kubernetes.io/name\\\\\":\\\\\"node-exporter\\\\\"}, MatchExpressions:[]v1.LabelSelectorRequirement(nil)}: field is immutable\",\"reason\":\"Invalid\",\"details\":{\"name\":\"node-exporter\",\"group\":\"apps\",\"kind\":\"DaemonSet\",\"causes\":[{\"reason\":\"FieldValueInvalid\",\"message\":\"Invalid value: v1.LabelSelector{MatchLabels:map[string]string{\\\\\"app\\\\\":\\\\\"node-exporter\\\\\", \\\\\"app.kubernetes.io/name\\\\\":\\\\\"node-exporter\\\\\"}, MatchExpressions:[]v1.LabelSelectorRequirement(nil)}: field is immutable\",\"field\":\"spec.selector\"}]},\"code\":422}\\n'", "reason": "Unprocessable Entity", "status": 422}

(That's from Ansible—basically it's trying to change the selector but that field is immutable. Seems unrelated to the armexporter though... I'm not sure why the DS has 0 desired instances.

geerlingguy commented 4 years ago

I wonder if this issue might be related to something like https://github.com/kubernetes/kubernetes/issues/51785. I'm going to manually delete the daemonset and re-create it and see if that works.

geerlingguy commented 4 years ago

Ah... I think this is the issue:

      nodeSelector:                                                                                                                                                                                                      
        beta.kubernetes.io/arch: arm64

It seems that label is deprecated (https://kubernetes.io/docs/reference/kubernetes-api/labels-annotations-taints/#betakubernetesioarch-deprecated).

But in my case, I switched it to arm and the pods were deployed. (I'm running HypriotOS, but Raspbian would be the same.)

geerlingguy commented 4 years ago

And I'm getting CPU temperature values, yay!

Screen Shot 2020-05-23 at 7 16 57 PM

Is there a way you can set it to a set of arm64|arm?

geerlingguy commented 4 years ago

Although now I'm not getting the other values (CPU, Memory, etc.)—only the temperature data and up/down status :/ — possibly something silly I did though. I might just wipe the cluster and reinstall again, now that there are a number of small tweaks.

[Edit: I think that's because after everything was deployed, I deleted the node-exporter DaemonSet then re-created it; prometheus was showing 0/0 exporters discovered, so I'm just re-deploying everything now.]

So back to the original question above—can we make it a list of arm64 or arm (instead of one or the other)? Everything seems to work fine on regular arm.

geerlingguy commented 4 years ago

I replaced the nodeSelector with affinity, and it's now scheduling things correctly:

    spec:                                                                                 
      affinity:                                                                           
        nodeAffinity:                                                  
          requiredDuringSchedulingIgnoredDuringExecution:              
            nodeSelectorTerms:                                         
            - matchExpressions:                                        
              - key: kubernetes.io/arch                                
                operator: In                                           
                values:                                                
                - arm                                                  
                - arm64 

I may submit a PR in a few minutes; I'm also testing the fix for #40 first.

geerlingguy commented 4 years ago

PR #42 filed for this—it is in draft, though, because jsonnet fails with a missing Field.

bee-san commented 1 year ago

@geerlingguy funny story, I am watching your pi cluster video and I have the same problem -- funny to see you here 😂 Great video, thanks <3