carlosedp / cluster-monitoring

Cluster monitoring stack for clusters based on Prometheus Operator
MIT License
740 stars 200 forks source link

For questions, doubts, guidances please use Discussions. Don't open a new Issue. #91

Closed carlosedp closed 2 years ago

carlosedp commented 4 years ago

Since I don't have too many resources and time to address all questions regarding the deployments, the Issues section is a place to report problems or improvements to the stack.

This issue is a place where you can add a comment in case of a question where me or any community member can answer in a best effort manner.

If you deployed the monitoring stack and some targets are not available or showing no metrics in Grafana, make sure you don't have IPTables rules or use a firewall on your nodes before deploying Kubernetes.

If you don't want to receive further notifications, click "Unsubscribe" in the right bar, right above the participants list.

carlosedp commented 3 years ago

@carlosedp I've run into an issue where the ingresses often change location, and I can no longer load the Grafana page when that happens:

[2020-10-31 09:17:31+2 ✘][~]
[james@tpin1][$ sudo kubectl get ingress -o wide -n monitoring
NAME                CLASS    HOSTS                             ADDRESS       PORTS     AGE
grafana             <none>   grafana.10.10.50.24.nip.io        10.10.50.23   80, 443   3d14h
alertmanager-main   <none>   alertmanager.10.10.50.24.nip.io   10.10.50.23   80, 443   3d14h
prometheus-k8s      <none>   prometheus.10.10.50.24.nip.io     10.10.50.23   80, 443   3d14h

I tried to run the make target to update the ingress suffix, but wasn't quite sure if that was the right command to fix the problem?

You can edit the manifests themselves or change the vars.jsonnet and regenerate the manifests, than apply again to the cluster.

StianHaug commented 3 years ago

I have this up an running on my pi4 cluster and it works very well. (Thank you very much! Its awesome!) I have a microcontroller running in some custom hardware and it has a webserver that posts metrics on `192.168.xx.xx/metrics and am trying to add it as a target for Prometheus to get the data from.

Normally I think it would be set up in the prometheus.yml file something like this:

scrape_configs:
 - job_name: "somemetrics"
   scrape_interval: 5s
   scrape_timeout: 5s
   static_configs:
    - targets: ['192.168.xx.xx']

I tried to add the same configuration to the vars.jsonnet file like this:

  prometheus: {
    retention: '15d',
    scrapeInterval: '30s',
    scrapeTimeout: '30s',
    scrape_configs: [
      {
        job_name: 'somemetrics',
        scrape_interval: '5s',
        scrape_timeout: '5s',
        static_configs: [
          {
            targets: [
              '192.168.xx.xx',
            ],
          },
        ],
      },
    ],
  },

However, this does not seem to work and it does not appear as a target under the /targets page in the prometheus web interface. I might be missing something obvious here so please point me in the right direction if I am. My question is essentially ho do I set up my targets like above mentioned? Is it supposed to be done via the vars.jsonnet file or am I going about this wrong?

wargfn commented 3 years ago

Looking at this, your targets also need to be one the same container networks that your prometheus stack is on. OR You need to accept incoming container port forwarding. If you followed the steps in the Make Deploy, it created a monitoring network, you need to enable ingress into the prometheus controller.

On Sun, Nov 29, 2020 at 1:58 PM MrCravon notifications@github.com wrote:

I have this up an running on my pi4 cluster and it works very well. (Thank you very much! Its awesome!) I have a microcontroller running in some custom hardware and it has a webserver that posts metrics on `192.168.xx.xx/metrics and am trying to add it as a target for Prometheus to get the data from.

Normally I think it would be set up in the prometheus.yml file something like this:

scrape_configs:

  • job_name: "somemetrics" scrape_interval: 5s scrape_timeout: 5s static_configs:
    • targets: ['192.168.xx.xx']

I tried to add the same configuration to the vars.jsonnet file like this:

prometheus: { retention: '15d', scrapeInterval: '30s', scrapeTimeout: '30s', scrape_configs: [ { job_name: 'somemetrics', scrape_interval: '5s', scrape_timeout: '5s', static_configs: [ { targets: [ '192.168.xx.xx', ], }, ], }, ], },

However, this does not seem to work and it does not appear as a target under the /targets page in the prometheus web interface. I might be missing something obvious here so please point me in the right direction if I am. My question is essentially ho do I set up my targets like above mentioned? Is it supposed to be done via the vars.jsonnet file or am I going about this wrong?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/carlosedp/cluster-monitoring/issues/91#issuecomment-735439026, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAC26VXLNYVKVURFE3ZA7BTSSKKXBANCNFSM4QGQ2QHA .

StianHaug commented 3 years ago

Thanks, I'll look more in to the ingress part of this tomorrow. However, I was under the impression that ingress was used to manage external access IN to the services in the containers. Like a client outside the cluster getting access to the web interfaces for prometheus and grafana. However in my case the 192.168.xx.xx is a webserver outside the cluster on the local network. The page on the webserver called /metrics would then be accessed by the prometheus scraper as a client. So would I need still need a ingress for the prometheus container scraper to access a webserver outside the cluster?

robmit68 commented 3 years ago

Carlos and team, i have added a Prometheus snmp-exporter into the same monitoring cluster to this awesome deployment in order to scrape snmp managed Cisco devices. I have deployed snmp-exporter as you suggested via helm charts on the same monitoring namespace, what i need assistance with mine particular situation is how can i update the prometheus.yml file to start scraping this SNMP managed devices? and the ability to add the snmp.yml into the prometheus-k8s-0 container?. i appreciated the support.

exArax commented 3 years ago

Hello,

How we can add a new dashboard with some panels at grafana-dashboardDefinitions? Could this be done by adding an extra JSON in the grafana-DashboardDefinitions.yaml or I have to do something more? What I was thinking is to deploy the stack adding new dashboard with the Web GUI and then copy the JSON that it produces to add it in grafana-dashboardDefinitions, is any other approach available ? When I follow this approach kubectl says that the appropriate config map is created but in the dashboards there is none of my custom dashboards

EDIT The solution was to add some extra line to the deployment file of grafana in the part of volume mounts and to declare the the config map below the volumes part

tylerlittlefield commented 3 years ago

Update on the stuff below

I've switched from wifi to ethernet on the raspberry pi cluster and everything just works. So for anyone out there experiencing and issue like I mentioned below, you might try ethernet instead.


@rur0 Did you ever resolve the issue? I am getting the same thing. On a fresh Raspberry Pi cluster, running k3s on Ubuntu Server 20.10, I see the following:

root@main:~/cluster-monitoring# kubectl get pods -n monitoring
NAME                                  READY   STATUS                 RESTARTS   AGE
node-exporter-p8qk8                   2/2     Running                0          24s
prometheus-adapter-9c79c98f7-wfh2n    1/1     Running                0          22s
arm-exporter-h22xf                    2/2     Running                0          35s
arm-exporter-rpbqb                    0/2     CreateContainerError   0          35s
kube-state-metrics-857f95d994-crhd4   0/3     CrashLoopBackOff       3          25s
prometheus-operator-67586fc88-dlfbt   0/2     RunContainerError      6          53s
grafana-655d666bbb-lt7xv              0/1     RunContainerError      2          26s
node-exporter-7fg5z                   0/2     CrashLoopBackOff       4          24s
root@main:~/cluster-monitoring# kubectl describe pod -n monitoring arm-exporter-rpbqb
Events:
  Type     Reason          Age                    From                Message
  ----     ------          ----                   ----                -------
  Normal   Scheduled       <unknown>              default-scheduler   Successfully assigned monitoring/arm-exporter-rpbqb to worker-01
  Warning  Failed          2m22s                  kubelet, worker-01  Error: failed to get sandbox container task: no running task found: task 939b8c9fb00b22dcfd7f10ef4815f330f3b56204b40a640d12127f0c7733ce2a not found: not found
  Warning  Failed          2m22s                  kubelet, worker-01  Error: failed to get sandbox container task: no running task found: task 939b8c9fb00b22dcfd7f10ef4815f330f3b56204b40a640d12127f0c7733ce2a not found: not found
  Warning  Failed          2m18s                  kubelet, worker-01  Error: failed to get sandbox container task: no running task found: task 197d303f31f3e16d5ca686ad3d783292d6d8e9607bc452bec3010d09319d8c7c not found: not found
  Warning  Failed          2m18s                  kubelet, worker-01  Error: failed to get sandbox container task: no running task found: task 197d303f31f3e16d5ca686ad3d783292d6d8e9607bc452bec3010d09319d8c7c not found: not found
  Warning  Failed          2m15s                  kubelet, worker-01  Error: failed to get sandbox container task: no running task found: task ddfbdfe9a1372779e41ab400bcd0cf47722f722fb48060dc4d0e62609d87669e not found: not found
  Warning  Failed          2m15s                  kubelet, worker-01  Error: failed to get sandbox container task: no running task found: task ddfbdfe9a1372779e41ab400bcd0cf47722f722fb48060dc4d0e62609d87669e not found: not found
  Normal   Pulled          2m11s (x4 over 2m22s)  kubelet, worker-01  Successfully pulled image "carlosedp/arm_exporter:latest"
  Warning  Failed          2m11s                  kubelet, worker-01  Error: failed to get sandbox container task: no running task found: task b87a5e0a9eae8daf52cabe9cf7c838f57cd2f738e47bfc81e88b99d789d0b698 not found: not found
  Normal   Pulled          2m11s (x4 over 2m22s)  kubelet, worker-01  Container image "carlosedp/kube-rbac-proxy:v0.5.0" already present on machine
  Warning  Failed          2m11s                  kubelet, worker-01  Error: failed to get sandbox container task: no running task found: task b87a5e0a9eae8daf52cabe9cf7c838f57cd2f738e47bfc81e88b99d789d0b698 not found: not found
  Normal   SandboxChanged  2m11s (x4 over 2m21s)  kubelet, worker-01  Pod sandbox changed, it will be killed and re-created.
  Normal   Pulling         2m6s (x5 over 2m23s)   kubelet, worker-01  Pulling image "carlosedp/arm_exporter:latest"
root@main:~/cluster-monitoring# kubectl describe pod -n monitoring kube-state-metrics-857f95d994-crhd4
Events:
  Type     Reason          Age                   From                Message
  ----     ------          ----                  ----                -------
  Normal   Scheduled       <unknown>             default-scheduler   Successfully assigned monitoring/kube-state-metrics-857f95d994-crhd4 to worker-01
  Warning  FailedMount     4m4s (x2 over 4m6s)   kubelet, worker-01  MountVolume.SetUp failed for volume "kube-state-metrics-token-t5fx4" : failed to sync secret cache: timed out waiting for the condition
  Warning  Failed          4m                    kubelet, worker-01  Error: failed to get sandbox container task: no running task found: task a164f027602367fe02063e752a2f57874cf803c21033ff40b0163b00d326e41f not found: not found
  Warning  Failed          4m                    kubelet, worker-01  Error: failed to get sandbox container task: no running task found: task a164f027602367fe02063e752a2f57874cf803c21033ff40b0163b00d326e41f not found: not found
  Warning  Failed          4m                    kubelet, worker-01  Error: failed to create containerd task: OCI runtime create failed: container_linux.go:341: creating new parent process caused "container_linux.go:1923: running lstat on namespace path \"/proc/1072790/ns/ipc\" caused \"lstat /proc/1072790/ns/ipc: no such file or directory\"": unknown
  Warning  Failed          3m58s                 kubelet, worker-01  Error: failed to get sandbox container task: no running task found: task 4c9449c09877664bcebb21b0be7f7d1074b2069071c88296ad7dd193cf986f18 not found: not found
  Normal   Pulled          3m58s (x2 over 4m1s)  kubelet, worker-01  Container image "carlosedp/kube-state-metrics:v1.9.6" already present on machine
  Normal   Created         3m58s (x2 over 4m1s)  kubelet, worker-01  Created container kube-state-metrics
  Warning  Failed          3m58s                 kubelet, worker-01  Error: sandbox container "4c9449c09877664bcebb21b0be7f7d1074b2069071c88296ad7dd193cf986f18" is not running
  Warning  Failed          3m58s                 kubelet, worker-01  Error: failed to get sandbox container task: no running task found: task 4c9449c09877664bcebb21b0be7f7d1074b2069071c88296ad7dd193cf986f18 not found: not found
  Warning  BackOff         3m56s                 kubelet, worker-01  Back-off restarting failed container
  Normal   Pulled          3m56s (x3 over 4m)    kubelet, worker-01  Container image "carlosedp/kube-rbac-proxy:v0.5.0" already present on machine
  Normal   Created         3m56s                 kubelet, worker-01  Created container kube-rbac-proxy-main
  Warning  Failed          3m56s                 kubelet, worker-01  Error: sandbox container "73acb31399df06c63f892a58a23ccc68ff9392f55184c28bdf6b03925161adbd" is not running
  Normal   Pulled          3m56s (x3 over 4m)    kubelet, worker-01  Container image "carlosedp/kube-rbac-proxy:v0.5.0" already present on machine
  Warning  Failed          3m56s                 kubelet, worker-01  Error: failed to get sandbox container task: no running task found: task 73acb31399df06c63f892a58a23ccc68ff9392f55184c28bdf6b03925161adbd not found: not found
  Normal   SandboxChanged  3m55s (x3 over 4m)    kubelet, worker-01  Pod sandbox changed, it will be killed and re-created.
root@main:~/cluster-monitoring# kubectl describe pod -n monitoring prometheus-operator-67586fc88-dlfbt
Events:
  Type     Reason          Age                    From                Message
  ----     ------          ----                   ----                -------
  Normal   Scheduled       <unknown>              default-scheduler   Successfully assigned monitoring/prometheus-operator-67586fc88-dlfbt to worker-01
  Warning  Failed          5m25s                  kubelet, worker-01  Error: sandbox container "9e6b794cfb528efdd28ca2c43de3db000ca3666891358bed726d4ddab3551662" is not running
  Warning  Failed          5m25s                  kubelet, worker-01  Error: failed to get sandbox container task: no running task found: task 9e6b794cfb528efdd28ca2c43de3db000ca3666891358bed726d4ddab3551662 not found: not found
  Normal   Pulled          5m24s (x2 over 5m25s)  kubelet, worker-01  Container image "carlosedp/prometheus-operator:v0.40.0" already present on machine
  Normal   Created         5m24s (x2 over 5m25s)  kubelet, worker-01  Created container prometheus-operator
  Warning  Failed          5m24s                  kubelet, worker-01  Error: sandbox container "74fda8c4bf09286f857d695335e6818ef8694453ccec55743f3d124071ef74d0" is not running
  Warning  Failed          5m24s                  kubelet, worker-01  Error: failed to get sandbox container task: no running task found: task 74fda8c4bf09286f857d695335e6818ef8694453ccec55743f3d124071ef74d0 not found: not found
  Warning  Failed          5m20s                  kubelet, worker-01  Error: failed to create containerd task: OCI runtime create failed: container_linux.go:341: creating new parent process caused "container_linux.go:1923: running lstat on namespace path \"/proc/1062821/ns/ipc\" caused \"lstat /proc/1062821/ns/ipc: no such file or directory\"": unknown
  Normal   Pulled          5m18s (x4 over 5m25s)  kubelet, worker-01  Container image "carlosedp/kube-rbac-proxy:v0.5.0" already present on machine
  Normal   Created         5m18s (x2 over 5m20s)  kubelet, worker-01  Created container kube-rbac-proxy
  Warning  Failed          5m18s                  kubelet, worker-01  Error: sandbox container "55e4c32d4561226673a78af5ea4023544bada8e491924545ffa87ff5d9ef0799" is not running
  Warning  BackOff         5m16s                  kubelet, worker-01  Back-off restarting failed container
  Warning  BackOff         5m16s (x3 over 5m20s)  kubelet, worker-01  Back-off restarting failed container
  Normal   SandboxChanged  22s (x61 over 5m25s)   kubelet, worker-01  Pod sandbox changed, it will be killed and re-created.
root@main:~/cluster-monitoring# kubectl describe pod -n monitoring grafana-655d666bbb-lt7xv
Events:
  Type     Reason       Age                     From                Message
  ----     ------       ----                    ----                -------
  Normal   Scheduled    <unknown>               default-scheduler   Successfully assigned monitoring/grafana-655d666bbb-lt7xv to worker-01
  Warning  FailedMount  5m45s                   kubelet, worker-01  MountVolume.SetUp failed for volume "grafana-dashboard-controller-manager" : failed to sync configmap cache: timed out waiting for the condition
  Warning  FailedMount  5m45s                   kubelet, worker-01  MountVolume.SetUp failed for volume "grafana-dashboard-traefik-dashboard" : failed to sync configmap cache: timed out waiting for the condition
  Warning  FailedMount  5m45s                   kubelet, worker-01  MountVolume.SetUp failed for volume "grafana-dashboard-prometheus-dashboard" : failed to sync configmap cache: timed out waiting for the condition
  Warning  FailedMount  5m45s                   kubelet, worker-01  MountVolume.SetUp failed for volume "grafana-dashboard-statefulset" : failed to sync configmap cache: timed out waiting for the condition
  Warning  FailedMount  5m45s                   kubelet, worker-01  MountVolume.SetUp failed for volume "grafana-dashboard-persistentvolumesusage" : failed to sync configmap cache: timed out waiting for the condition
  Warning  FailedMount  5m45s                   kubelet, worker-01  MountVolume.SetUp failed for volume "grafana-dashboard-prometheus" : failed to sync configmap cache: timed out waiting for the condition
  Warning  FailedMount  5m45s                   kubelet, worker-01  MountVolume.SetUp failed for volume "grafana-dashboard-k8s-resources-cluster" : failed to sync configmap cache: timed out waiting for the condition
  Warning  FailedMount  5m45s                   kubelet, worker-01  MountVolume.SetUp failed for volume "grafana-dashboard-k8s-resources-workload" : failed to sync configmap cache: timed out waiting for the condition
  Warning  FailedMount  5m44s (x2 over 5m45s)   kubelet, worker-01  MountVolume.SetUp failed for volume "grafana-dashboard-apiserver" : failed to sync configmap cache: timed out waiting for the condition
  Warning  FailedMount  5m44s                   kubelet, worker-01  MountVolume.SetUp failed for volume "grafana-dashboard-nodes" : failed to sync configmap cache: timed out waiting for the condition
  Warning  FailedMount  5m44s                   kubelet, worker-01  MountVolume.SetUp failed for volume "grafana-dashboard-k8s-resources-workloads-namespace" : failed to sync configmap cache: timed out waiting for the condition
  Warning  FailedMount  5m42s (x12 over 5m44s)  kubelet, worker-01  (combined from similar events): MountVolume.SetUp failed for volume "grafana-dashboard-k8s-resources-pod" : failed to sync configmap cache: timed out waiting for the condition
  Normal   Pulled       5m41s                   kubelet, worker-01  Container image "grafana/grafana:7.0.3" already present on machine
  Warning  BackOff      44s (x61 over 5m35s)    kubelet, worker-01  Back-off restarting failed container
root@main:~/cluster-monitoring# kubectl describe pod -n monitoring node-exporter-7fg5z
Events:
  Type     Reason          Age                    From                Message
  ----     ------          ----                   ----                -------
  Normal   Scheduled       <unknown>              default-scheduler   Successfully assigned monitoring/node-exporter-7fg5z to worker-01
  Warning  FailedMount     6m15s (x2 over 6m16s)  kubelet, worker-01  MountVolume.SetUp failed for volume "node-exporter-token-ntcrf" : failed to sync secret cache: timed out waiting for the condition
  Warning  Failed          6m12s                  kubelet, worker-01  Error: sandbox container "5a48a1bacbd4974d3470302bce815a842e08dc9906589a8fcc2b2e025766891c" is not running
  Warning  Failed          6m12s                  kubelet, worker-01  Error: failed to get sandbox container task: no running task found: task 5a48a1bacbd4974d3470302bce815a842e08dc9906589a8fcc2b2e025766891c not found: not found
  Normal   Pulled          6m10s (x2 over 6m13s)  kubelet, worker-01  Container image "prom/node-exporter:v0.18.1" already present on machine
  Normal   Created         6m10s (x2 over 6m12s)  kubelet, worker-01  Created container node-exporter
  Warning  Failed          6m10s                  kubelet, worker-01  Error: sandbox container "6bc941208b1c99a891c941ec7bb21bb5b3c3419758eba92b0d7e3d33e2d6536a" is not running
  Warning  Failed          6m10s                  kubelet, worker-01  Error: failed to get sandbox container task: no running task found: task 6bc941208b1c99a891c941ec7bb21bb5b3c3419758eba92b0d7e3d33e2d6536a not found: not found
  Warning  Failed          6m8s                   kubelet, worker-01  Error: failed to create containerd task: OCI runtime create failed: container_linux.go:341: creating new parent process caused "container_linux.go:1923: running lstat on namespace path \"/proc/1074283/ns/ipc\" caused \"lstat /proc/1074283/ns/ipc: no such file or directory\"": unknown
  Normal   Pulled          6m7s (x4 over 6m12s)   kubelet, worker-01  Container image "carlosedp/kube-rbac-proxy:v0.5.0" already present on machine
  Normal   Created         6m7s (x2 over 6m9s)    kubelet, worker-01  Created container kube-rbac-proxy
  Warning  Failed          6m6s                   kubelet, worker-01  Error: failed to create containerd task: OCI runtime create failed: container_linux.go:341: creating new parent process caused "container_linux.go:1923: running lstat on namespace path \"/proc/1075142/ns/ipc\" caused \"lstat /proc/1075142/ns/ipc: no such file or directory\"": unknown
  Normal   SandboxChanged  6m6s (x4 over 6m12s)   kubelet, worker-01  Pod sandbox changed, it will be killed and re-created.
  Warning  BackOff         6m5s (x3 over 6m9s)    kubelet, worker-01  Back-off restarting failed container
  Warning  BackOff         74s (x211 over 6m5s)   kubelet, worker-01  Back-off restarting failed container
simonspg commented 3 years ago

I just went through a power outage and would like to enable the APCUPSd exporter. However, I don't see a deployment. Is this designed to work with a SmartUPS and you are querying the UPS directly? If you are using apcupsd exporter, where does it run?

MovieMaker93 commented 3 years ago

photo5992579319500748632 Hi i have some problems with a cluster of raspberry pi with k3s, i m not be able to reach the nodes from outside. I used k3s built in with trafik and coredns, why im not be able to reach for example a simple hello world pod from outside? Any suggests? I tried even to install this monitoring but the address shown in the ingress didn't work. photo5992579319500748628 photo5992579319500748629 This is my k3s cluster configuration, something is wrong?

ToMe25 commented 3 years ago

I have recently deployed this to a k3s 1.20.0+k3s2 cluster with 3 Raspberry Pis. The only two enabled exporters are the armExporter and the traefikExporter.

The most important metrics all show up(Cluster Memory Usage, Cluster CPU Usage, Memory Usage, CPU Usage, CPU Temperature and a few more), however some other ones do not get any values. The not working ones are Pod CPU Usage, Pod Memory Usage, Sent Network Traffic per Container, Received Network Traffic per Container, and Pod Network i/o. They all show "No data".

I have looked at all the logs, and the only one that seemed like it might want to say anything but "everythings ok", was the one of the node-exporter containers, however even that one seemed pretty ok. Here it is:

kubectl logs -f -n monitoring node-exporter-qkxzr -c node-exporter
time="2020-12-27T11:46:51Z" level=info msg="Starting node_exporter (version=0.18.1, branch=HEAD, revision=3db77732e925c08f675d7404a8c46466b2ece83e)" source="node_exporter.go:156"
time="2020-12-27T11:46:51Z" level=info msg="Build context (go=go1.12.5, user=root@b50852a1acba, date=20190604-16:43:22)" source="node_exporter.go:157"
time="2020-12-27T11:46:51Z" level=info msg="Enabled collectors:" source="node_exporter.go:97"
time="2020-12-27T11:46:51Z" level=info msg=" - arp" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - bcache" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - bonding" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - conntrack" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - cpu" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - cpufreq" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - diskstats" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - edac" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - entropy" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - filefd" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - filesystem" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - infiniband" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - ipvs" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - loadavg" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - mdadm" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - meminfo" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - netclass" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - netdev" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - netstat" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - nfs" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - nfsd" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - pressure" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - sockstat" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - stat" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - textfile" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - time" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - timex" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - uname" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - vmstat" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - xfs" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - zfs" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg="Listening on 127.0.0.1:9100" source="node_exporter.go:170"

I have tried running make deploy again, but that didn't change anything. All the pods are shown as RUNNING.

These two targets are shown as DOWN, however k3s doesn't show any pod with names like that. grafik I also noticed some time later that the title is monitoring/kube-*, while the labels say these pods should be in the kube-system namespace.

Edit: I just tried running make vendor, make, and make deploy again, but this also didn't change anything. This and the ingresses seem to be the only thing that showed configured, rather then unchanged:

customresourcedefinition.apiextensions.k8s.io/alertmanagers.monitoring.coreos.com configured
customresourcedefinition.apiextensions.k8s.io/podmonitors.monitoring.coreos.com configured
customresourcedefinition.apiextensions.k8s.io/prometheuses.monitoring.coreos.com configured
customresourcedefinition.apiextensions.k8s.io/prometheusrules.monitoring.coreos.com configured
customresourcedefinition.apiextensions.k8s.io/servicemonitors.monitoring.coreos.com configured
customresourcedefinition.apiextensions.k8s.io/thanosrulers.monitoring.coreos.com configured

Also the monitoring namespace has a pod named prometheus-k8s-0, even tho its k3s, but i am not sure whether thats an issue.

skhraashid commented 3 years ago

I solve this issue. Thank for advice, at the end I just installed nginx, configured it and after that I was able to access to prometheus and grafana. Thanks a lot!

Hi, can u please provide the detail step to resolve this same issue ? thanks

ToMe25 commented 3 years ago

Is my issue maybe related to this?

Kube-apiserver: the componentstatus API is deprecated. This API provided status of etcd, kube-scheduler, and kube-controller-manager components, but only worked when those components were local to the API server, and when kube-scheduler and kube-controller-manager exposed unsecured health endpoints. Instead of this API, etcd health is included in the kube-apiserver health check and kube-scheduler/kube-controller-manager health checks can be made directly against those components' health endpoints. (#93570, @liggitt) [SIG API Machinery, Apps and Cluster Lifecycle] - https://kubernetes.io/docs/setup/release/notes/#deprecation-5

ToMe25 commented 3 years ago

The node exporters are the only pods that run on the local Ip of the device they run on (192.168.X.X), instead of a cluster internal Ip (10.42.X.X).

Edit: my ingress problem solved itself after another restart. Also here is the result of kubectl get pods --all-namespaces -o wide:

NAMESPACE     NAME                                      READY   STATUS      RESTARTS   AGE     IP              NODE                  NOMINATED NODE   READINESS GATES
kube-system   helm-install-traefik-ld2nr                0/1     Completed   0          14h     10.42.0.2       raspberrypi-master    <none>           <none>
kube-system   svclb-traefik-8pr5w                       2/2     Running     4          14h     10.42.2.6       raspberrypi-worker1   <none>           <none>
monitoring    node-exporter-cnzrd                       2/2     Running     4          14h     192.168.2.104   raspberrypi-worker1   <none>           <none>
monitoring    arm-exporter-hlgpg                        2/2     Running     4          14h     10.42.2.7       raspberrypi-worker1   <none>           <none>
kube-system   metrics-server-86cbb8457f-hvckl           1/1     Running     2          14h     10.42.0.22      raspberrypi-master    <none>           <none>
kube-system   local-path-provisioner-7c458769fb-f7fn6   1/1     Running     2          14h     10.42.0.21      raspberrypi-master    <none>           <none>
monitoring    node-exporter-b24st                       2/2     Running     4          14h     192.168.2.106   raspberrypi-master    <none>           <none>
kube-system   svclb-traefik-7754c                       2/2     Running     4          14h     10.42.0.23      raspberrypi-master    <none>           <none>
kube-system   coredns-854c77959c-v67jt                  1/1     Running     2          14h     10.42.0.19      raspberrypi-master    <none>           <none>
monitoring    arm-exporter-c28hq                        2/2     Running     4          14h     10.42.0.20      raspberrypi-master    <none>           <none>
kube-system   svclb-traefik-lf4ck                       2/2     Running     4          14h     10.42.1.21      raspberrypi-worker2   <none>           <none>
monitoring    node-exporter-w56w2                       2/2     Running     4          14h     192.168.2.121   raspberrypi-worker2   <none>           <none>
monitoring    arm-exporter-q7wx2                        2/2     Running     4          14h     10.42.1.23      raspberrypi-worker2   <none>           <none>
monitoring    prometheus-operator-67755f959-zghjt       2/2     Running     4          14h     10.42.1.22      raspberrypi-worker2   <none>           <none>
monitoring    kube-state-metrics-6cb6df5d4-dvw4k        3/3     Running     6          14h     10.42.1.26      raspberrypi-worker2   <none>           <none>
kube-system   traefik-6f9cbd9bd4-ldv68                  1/1     Running     2          14h     10.42.1.20      raspberrypi-worker2   <none>           <none>
monitoring    alertmanager-main-0                       2/2     Running     4          14h     10.42.1.24      raspberrypi-worker2   <none>           <none>
monitoring    prometheus-adapter-585b57857b-lp6vl       1/1     Running     2          14h     10.42.1.27      raspberrypi-worker2   <none>           <none>
monitoring    grafana-7cccfc9b5f-bd4sh                  1/1     Running     2          14h     10.42.1.25      raspberrypi-worker2   <none>           <none>
monitoring    prometheus-k8s-0                          3/3     Running     0          2m40s   10.42.1.28      raspberrypi-worker2   <none>           <none>
carlosedp commented 3 years ago

@carlosedp - Thanks. So people run their K8s clusters without any firewalls on the servers? That’s an interesting paradigm shift, indeed! 60 % of the time, it works every time On 28 Oct 2020, at 19:27, Carlos Eduardo @.***> wrote:  @jjo93sa usually on Kubernetes clusters, we don't set IPTables rules so they don't mess with Kubernetes rules and block required ports. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

Between the nodes, yes. Protect the cluster itself not between cluster nodes.

carlosedp commented 3 years ago

@carlosedp I've run into an issue where the ingresses often change location, and I can no longer load the Grafana page when that happens:

[2020-10-31 09:17:31+2 ✘][~]
[james@tpin1][$ sudo kubectl get ingress -o wide -n monitoring
NAME                CLASS    HOSTS                             ADDRESS       PORTS     AGE
grafana             <none>   grafana.10.10.50.24.nip.io        10.10.50.23   80, 443   3d14h
alertmanager-main   <none>   alertmanager.10.10.50.24.nip.io   10.10.50.23   80, 443   3d14h
prometheus-k8s      <none>   prometheus.10.10.50.24.nip.io     10.10.50.23   80, 443   3d14h

I tried to run the make target to update the ingress suffix, but wasn't quite sure if that was the right command to fix the problem?

@jjo93sa you need to update the config and re-apply the ingress manifests.

carlosedp commented 3 years ago

Carlos and team, i have added a Prometheus snmp-exporter into the same monitoring cluster to this awesome deployment in order to scrape snmp managed Cisco devices. I have deployed snmp-exporter as you suggested via helm charts on the same monitoring namespace, what i need assistance with mine particular situation is how can i update the prometheus.yml file to start scraping this SNMP managed devices? and the ability to add the snmp.yml into the prometheus-k8s-0 container?. i appreciated the support.

@robmit68 you need to create a custom ServiceMonitor with the prometheus config pointing to the SNMP exporter and targeting your device. It's something like https://github.com/carlosedp/ddwrt-monitoring/blob/02f3013c3acb80ec048c77469f64f76c5e2406e3/prometheus/prometheus.yml#L37

carlosedp commented 3 years ago

I just went through a power outage and would like to enable the APCUPSd exporter. However, I don't see a deployment. Is this designed to work with a SmartUPS and you are querying the UPS directly? If you are using apcupsd exporter, where does it run?

@simonspg In this deployment, apcupsd and the exporter are running on an external host and Prometheus just queries it.

You could also run the ups-exporter in the cluster querying the apcupsd host (that also has the UPS running) but it would require adding the deployment in modules/ups_exporter.jsonnet.

carlosedp commented 3 years ago

photo5992579319500748632 Hi i have some problems with a cluster of raspberry pi with k3s, i m not be able to reach the nodes from outside. I used k3s built in with trafik and coredns, why im not be able to reach for example a simple hello world pod from outside? Any suggests? I tried even to install this monitoring but the address shown in the ingress didn't work. photo5992579319500748628 photo5992579319500748629 This is my k3s cluster configuration, something is wrong?

@MovieMaker93 this is a question for K3s and not the monitoring stack, sorry, check the installation steps, your hosts and start small, then deploy the monitoring.

carlosedp commented 3 years ago

To all, in a couple weeks I'll check this stack in a fresh Kubernetes and K3s clusters to see if there are any required updates. Also will update libraries, dependencies and dashboards.

simonspg commented 3 years ago

Outstanding! Thank you for clarifying where ACPUPSd is running. It all makes sense now. Another question: I would like to add to and modify the supplied Grafana dashboards, but it seems that is not possible due to the way they are stored. How hard would it be to move the dashboard storage to a PVC? And one more question: I see timeouts on database access. I have MariaDB in a container for another project. How hard would it be to move to a MariaDB backend? THANK YOU for this fabulous monitoring solution!

Gooseman42 commented 3 years ago

Hi Carlos, thanks for a great piece of software! I found it via Jeff Geerling's blog and installed in on my TuringPi (K3s) with 7 CM3 installed. That said, with only your package installed I have rather high memory utilization (cluster memory usage 41%) on two nodes (700-800MB) with prometheus-k8s-0 eating close to 500 MB. Is this normal or did I miss a setting somewhere? Also I constantly get CPUThrottlingHigh alerts.

thorsten-l commented 3 years ago

Hi Carlos, thanks for a great piece of software! I found it via Jeff Geerling's blog and installed in on my TuringPi (K3s) with 7 CM3 installed. That said, with only your package installed I have rather high memory utilization (cluster memory usage 41%) on two nodes (700-800MB) with prometheus-k8s-0 eating close to 500 MB. Is this normal or did I miss a setting somewhere? Also I constantly get CPUThrottlingHigh alerts.

I've exactly the same configuration with the same issues

carlosedp commented 3 years ago

@Gooseman42 and @thorsten-l, monitoring infrastructure is not a "cheap" job and requires processing and memory resources proportional to the amount of nodes, events and a multitude of parameters collected from the cluster.

Although the stack works fine on Raspberry Pi nodes, it's meant to monitor from single node to multi-node production clusters. It's not optimized (and never meant to be) to run on small board computers.

Gooseman42 commented 3 years ago

All good, thanks for the clarification. I just wanted to make sure it wasn't due to some configuration issue.

Gory19 commented 3 years ago

I have recently deployed this to a k3s 1.20.0+k3s2 cluster with 3 Raspberry Pis. The only two enabled exporters are the armExporter and the traefikExporter.

The most important metrics all show up(Cluster Memory Usage, Cluster CPU Usage, Memory Usage, CPU Usage, CPU Temperature and a few more), however some other ones do not get any values. The not working ones are Pod CPU Usage, Pod Memory Usage, Sent Network Traffic per Container, Received Network Traffic per Container, and Pod Network i/o. They all show "No data".

I have looked at all the logs, and the only one that seemed like it might want to say anything but "everythings ok", was the one of the node-exporter containers, however even that one seemed pretty ok. Here it is:

kubectl logs -f -n monitoring node-exporter-qkxzr -c node-exporter
time="2020-12-27T11:46:51Z" level=info msg="Starting node_exporter (version=0.18.1, branch=HEAD, revision=3db77732e925c08f675d7404a8c46466b2ece83e)" source="node_exporter.go:156"
time="2020-12-27T11:46:51Z" level=info msg="Build context (go=go1.12.5, user=root@b50852a1acba, date=20190604-16:43:22)" source="node_exporter.go:157"
time="2020-12-27T11:46:51Z" level=info msg="Enabled collectors:" source="node_exporter.go:97"
time="2020-12-27T11:46:51Z" level=info msg=" - arp" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - bcache" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - bonding" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - conntrack" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - cpu" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - cpufreq" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - diskstats" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - edac" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - entropy" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - filefd" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - filesystem" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - infiniband" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - ipvs" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - loadavg" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - mdadm" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - meminfo" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - netclass" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - netdev" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - netstat" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - nfs" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - nfsd" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - pressure" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - sockstat" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - stat" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - textfile" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - time" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - timex" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - uname" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - vmstat" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - xfs" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - zfs" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg="Listening on 127.0.0.1:9100" source="node_exporter.go:170"

I have tried running make deploy again, but that didn't change anything. All the pods are shown as RUNNING.

These two targets are shown as DOWN, however k3s doesn't show any pod with names like that. grafik I also noticed some time later that the title is monitoring/kube-*, while the labels say these pods should be in the kube-system namespace.

Edit: I just tried running make vendor, make, and make deploy again, but this also didn't change anything. This and the ingresses seem to be the only thing that showed configured, rather then unchanged:

customresourcedefinition.apiextensions.k8s.io/alertmanagers.monitoring.coreos.com configured
customresourcedefinition.apiextensions.k8s.io/podmonitors.monitoring.coreos.com configured
customresourcedefinition.apiextensions.k8s.io/prometheuses.monitoring.coreos.com configured
customresourcedefinition.apiextensions.k8s.io/prometheusrules.monitoring.coreos.com configured
customresourcedefinition.apiextensions.k8s.io/servicemonitors.monitoring.coreos.com configured
customresourcedefinition.apiextensions.k8s.io/thanosrulers.monitoring.coreos.com configured

Also the monitoring namespace has a pod named prometheus-k8s-0, even tho its k3s, but i am not sure whether thats an issue.

I have the same problem. How did you manage to solve?

ToMe25 commented 3 years ago

I did not manage to solve it yet. I hope that the dependency update will fix this, whenever that happens.

exArax commented 3 years ago

Hello,

Is there a way to declare in the deployment file of Grafana the admin password and some group of users with their roles/permissions ?

Northwood128 commented 3 years ago

I have recently deployed this to a k3s 1.20.0+k3s2 cluster with 3 Raspberry Pis. The only two enabled exporters are the armExporter and the traefikExporter. The most important metrics all show up(Cluster Memory Usage, Cluster CPU Usage, Memory Usage, CPU Usage, CPU Temperature and a few more), however some other ones do not get any values. The not working ones are Pod CPU Usage, Pod Memory Usage, Sent Network Traffic per Container, Received Network Traffic per Container, and Pod Network i/o. They all show "No data". I have looked at all the logs, and the only one that seemed like it might want to say anything but "everythings ok", was the one of the node-exporter containers, however even that one seemed pretty ok. Here it is:

kubectl logs -f -n monitoring node-exporter-qkxzr -c node-exporter
time="2020-12-27T11:46:51Z" level=info msg="Starting node_exporter (version=0.18.1, branch=HEAD, revision=3db77732e925c08f675d7404a8c46466b2ece83e)" source="node_exporter.go:156"
time="2020-12-27T11:46:51Z" level=info msg="Build context (go=go1.12.5, user=root@b50852a1acba, date=20190604-16:43:22)" source="node_exporter.go:157"
time="2020-12-27T11:46:51Z" level=info msg="Enabled collectors:" source="node_exporter.go:97"
time="2020-12-27T11:46:51Z" level=info msg=" - arp" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - bcache" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - bonding" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - conntrack" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - cpu" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - cpufreq" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - diskstats" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - edac" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - entropy" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - filefd" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - filesystem" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - infiniband" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - ipvs" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - loadavg" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - mdadm" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - meminfo" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - netclass" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - netdev" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - netstat" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - nfs" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - nfsd" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - pressure" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - sockstat" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - stat" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - textfile" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - time" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - timex" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - uname" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - vmstat" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - xfs" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg=" - zfs" source="node_exporter.go:104"
time="2020-12-27T11:46:51Z" level=info msg="Listening on 127.0.0.1:9100" source="node_exporter.go:170"

I have tried running make deploy again, but that didn't change anything. All the pods are shown as RUNNING. These two targets are shown as DOWN, however k3s doesn't show any pod with names like that. grafik I also noticed some time later that the title is monitoring/kube-*, while the labels say these pods should be in the kube-system namespace. Edit: I just tried running make vendor, make, and make deploy again, but this also didn't change anything. This and the ingresses seem to be the only thing that showed configured, rather then unchanged:

customresourcedefinition.apiextensions.k8s.io/alertmanagers.monitoring.coreos.com configured
customresourcedefinition.apiextensions.k8s.io/podmonitors.monitoring.coreos.com configured
customresourcedefinition.apiextensions.k8s.io/prometheuses.monitoring.coreos.com configured
customresourcedefinition.apiextensions.k8s.io/prometheusrules.monitoring.coreos.com configured
customresourcedefinition.apiextensions.k8s.io/servicemonitors.monitoring.coreos.com configured
customresourcedefinition.apiextensions.k8s.io/thanosrulers.monitoring.coreos.com configured

Also the monitoring namespace has a pod named prometheus-k8s-0, even tho its k3s, but i am not sure whether thats an issue.

I have the same problem. How did you manage to solve?

+1

exArax commented 3 years ago

When I perform kubectl apply manifests and manifests/setup I get this error for kube-state-metrics and prometheus operator

kubectl logs -n monitoring prometheus-operator-67755f959-rctgl kube-rbac-proxy I0318 14:07:13.720068 1 main.go:186] Valid token audiences: I0318 14:07:13.720148 1 main.go:232] Generating self signed cert as no cert is provided runtime: mlock of signal stack failed: 12 runtime: increase the mlock limit (ulimit -l) or runtime: update your kernel to 5.3.15+, 5.4.2+, or 5.5+ fatal error: mlock failed

ToMe25 commented 3 years ago

To all, in a couple weeks I'll check this stack in a fresh Kubernetes and K3s clusters to see if there are any required updates. Also will update libraries, dependencies and dashboards.

This was now 2⅔ months ago, is there a rough estimate as to when this update will take place?

ToMe25 commented 3 years ago

Updating k3s from 1.20.0+k3s2 to 1.20.4+k3s1 solved the graphs showing "No Data" for me. The two targets i mentioned are still shown as down, however this doesn't seem to affect much? Also for some prometheus is now using 2GB of ram, which is an issue on a Raspberry Pi, but i hope it only does so while cleaning up and stops that later. @Gory19 @Northwood128 @alexwilliams0712

radicalgeek commented 3 years ago

Firstly excellent project. Very well done. I do not have a real issue but I am seeking some guidance. I am pretty new to all this. I know Docker well, but this is my first stab at kubernetes and I only put my pi cluster together a couple of weeks ago. I have everything up and running and have added additional exporters on my router for UPS, network and NAS metrics, and I have created grafana dashboards to view them. I also have additional data sources to monitor my websites. Finally I have also added in the ELK stack with filebeat and metricbeat in the monitoring namespace and pointed a few things at each other.

However, I did not set up grafana or prometheus with persistent storage as I did not have anything set up. I have now purchased an additional Pi, SSDs and networking equipment to set up a SAN as an iSCSI target. I would now like to backup my grafana datasources and dashboards and redeploy the stack with persistent storage.

Backing up the json for the dashboards is easy. Data sources less so. What I would like to do of course is add the dashboards and datasources to the deployment. It looks like I just need to drop the dashboard json I export into grafana-dashboardDefinitions.yaml? Is there a way datasourses are injected too? according to https://grafana.com/docs/grafana/latest/administration/provisioning/#datasources I should be able to define them somewhere but grafana-dashboardDatasources.yaml is a secret, which doesn't seem to be deployed to my cluster (because I don't have persistent storage?) How can I go about capturing my data sources so I do not need to re-create them?

Just seeking a little guidance. And thanks again for the awesome work.

carlosedp commented 3 years ago

Hey @radicalgeek awesome progress on your learning experience.

First thing as you mentioned is exporting all things you created (dashboards, datasources and etc) to avoid losing them.

To have the dashboards deployed by the stack, I'd recommend creating something like the jsonnet modules I defined in modules dir and add it to the vars.jsonnet. This way, the deployment would add it to your stack.

The datasource needs a bit of investigation but I believe you might need to create another manifest like the grafana-dashboardDatasources.yaml. The content of it (the datasources.yaml line) is a base64 encoded yaml file with the content. You can see it by echoing it's contents to |base64 -d.

Hope it helped.

carlosedp commented 3 years ago

To all, in a couple weeks I'll check this stack in a fresh Kubernetes and K3s clusters to see if there are any required updates. Also will update libraries, dependencies and dashboards.

This was now 2⅔ months ago, is there a rough estimate as to when this update will take place?

Sorry to all but I've been pretty busy with other projects and didn't have time (or focus) to update this.

As an open source project, I'd review and take PRs if someone is willing to do it.

exArax commented 3 years ago

I want to make alermanager send an alert to a python webhook, could you please inform me which yaml I have to edit to add new rules to alertmanager? Where should I declare the receiver ?

carlosedp commented 3 years ago

I want to make alermanager send an alert to a python webhook, could you please inform me which yaml I have to edit to add new rules to alertmanager? Where should I declare the receiver ?

You have to override AlertManager secret that contains the configuration. Webhook config documentation is at https://prometheus.io/docs/alerting/latest/configuration/#webhook_config.

To see how to override, look at the jsonnet sources from the libraries used here in the project.

sruckh commented 3 years ago

How do you update web certificates without tearing down and re-installing? I have created a SAN certificate for grafana, prometheus, and alertmanager, and put the server.crt and server.key files in the root base directory with the rest of the files. If I follow the instruction for my platform (K3s) for installing, everything works as expected.

When I renew my cert, what is the correct method for installing the new cert without having the teardown and re-install everything?

sruckh commented 3 years ago

To all, in a couple weeks I'll check this stack in a fresh Kubernetes and K3s clusters to see if there are any required updates. Also will update libraries, dependencies and dashboards.

This was now 2⅔ months ago, is there a rough estimate as to when this update will take place?

I have deployed cluster-monitoring on a raspberry pi 4 cluster running v1.20.7+k3s1 and things appear to be working.

armourshield commented 3 years ago

v1.20.7+k3s1

I have a new setup, it was done following this points

  1. OS

    Linux k3master1 5.4.0-73-generic #82-Ubuntu SMP Wed Apr 14 17:39:42 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
  2. Installation of K3s using k3sup

      --k3s-channel stable \
      --ip <ip> \
      --user root \
      --k3s-extra-args '--cluster-init --kube-apiserver-arg enable-aggregator-routing=true --disable traefik --no-deploy traefik --disable metrics-server --no-deploy metrics-server' \
      --datastore postgres://dbconnection

This installs the stable release -- K3s v1.20.7+k3s1.

  1. This are the options being used for the K3s
--cluster-init --kube-apiserver-arg enable-aggregator-routing=true --disable traefik --no-deploy traefik --disable metrics-server --no-deploy metrics-server

Followed the instructions of the manual deploy for monitoring.

Not getting data for many metrics. There was no error in the deployment part.

CPU usage, Memory Usage or anything related usage are not being shown.

image

EDIT: Just to make things easy to debug, when checked in events. Found issues of deployments not able to reach the ServiceAccounts redployed or started the deployment. The pods have started and logs suggest it is working but still no data for usage.

lfduarte2020 commented 3 years ago

Hi,

First of all, thank you for this great project. I've successfully deployed the monitoring and now I would like to add another rpi to Prometheus scrapping. This rpi is outside the k8s cluster and already have node_exporter installed. How to I had this node to prometheus?

lfduarte2020 commented 3 years ago

Firstly, thanks for all the work you put into this @carlosedp 👏🏻. Prometheus seems to be running into an error panic: mmap: cannot allocate memory, have you run into this before? Deleting the pod fixes the issue, and I do have memory available. Also - what is the best way to add additional targets? Thanks again

root@pi-master:/home/pi# kubectl version
Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.5+k3s1", GitCommit:"58ebdb2a2ec5318ca40649eb7bd31679cb679f71", GitTreeState:"clean", BuildDate:"2020-05-06T23:42:31Z", GoVersion:"go1.13.8", Compiler:"gc", Platform:"linux/arm"}
Server Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.5+k3s1", GitCommit:"58ebdb2a2ec5318ca40649eb7bd31679cb679f71", GitTreeState:"clean", BuildDate:"2020-05-06T23:42:31Z", GoVersion:"go1.13.8", Compiler:"gc", Platform:"linux/arm"}
root@pi-master:/home/pi#
root@pi-master:/home/pi# cat /etc/os-release
PRETTY_NAME="Raspbian GNU/Linux 10 (buster)"
NAME="Raspbian GNU/Linux"
VERSION_ID="10"
VERSION="10 (buster)"
VERSION_CODENAME=buster
ID=raspbian
ID_LIKE=debian
HOME_URL="http://www.raspbian.org/"
SUPPORT_URL="http://www.raspbian.org/RaspbianForums"
BUG_REPORT_URL="http://www.raspbian.org/RaspbianBugs"
root@pi-master:/home/pi#

Had the same issue. To solve I moved the OS to an 64bit version, in my case went to Ubuntu.

ToMe25 commented 3 years ago

There is also a 64bit prerelease version of Raspbian if you prefer, but you are probably not going to get around reinstalling, if this is actually the issue.

gvonbergen commented 3 years ago

Hi, Everyhting works fine. Thanks a lot for this cool repo!

One question. Where can I add additionalScrapeConfigs?

Best, Gregor

GeiserX commented 3 years ago

Thank you for this amazing project.

I've followed the tutorial here: https://kauri.io/#deploy-prometheus-and-grafana-to-monitor-a-kube/186a71b189864b9ebc4ef7c8a9f0a6b5/a

But I've found a fatal error while make deploy. I disabled the ingress in the vars.jsonnet but I still get the same error:

error validating "manifests/ingress-alertmanager.yaml": error validating data: [ValidationError(Ingress.spec.rules[0].http.paths[0].backend): unknown field "serviceName" in io.k8s.api.networking.v1.IngressBackend, ValidationError(Ingress.spec.rules[0].http.paths[0].backend): unknown field "servicePort" in io.k8s.api.networking.v1.IngressBackend]; if you choose to ignore these errors, turn validation off with --validate=false
error validating "manifests/ingress-grafana.yaml": error validating data: [ValidationError(Ingress.spec.rules[0].http.paths[0].backend): unknown field "serviceName" in io.k8s.api.networking.v1.IngressBackend, ValidationError(Ingress.spec.rules[0].http.paths[0].backend): unknown field "servicePort" in io.k8s.api.networking.v1.IngressBackend]; if you choose to ignore these errors, turn validation off with --validate=false
error validating "manifests/ingress-prometheus.yaml": error validating data: [ValidationError(Ingress.spec.rules[0].http.paths[0].backend): unknown field "serviceName" in io.k8s.api.networking.v1.IngressBackend, ValidationError(Ingress.spec.rules[0].http.paths[0].backend): unknown field "servicePort" in io.k8s.api.networking.v1.IngressBackend]; if you choose to ignore these errors, turn validation off with --validate=false

I have K3s version v1.20.7+k3s1 Thanks

lauchokyip commented 3 years ago

@carlosedp Kubernetes maintainers changed Ingress from extensions/v1beta1 to networking.k8s.io/v1 A quick and dirty way is to open Ingress-*.yaml and change the networking.k8s.io/v1 to extensions/v1beta1

However, after K8s releases 1.22, this method will fail

For long term fix:

Ingress-alertmanager.yaml

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: alertmanager-main
  namespace: monitoring
spec:
  tls:
  - hosts:
    -  alertmanager.192.168.1.15.nip.io
  rules:
  - host:  alertmanager.192.168.1.15.nip.io
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: alertmanager-main
            port:
              name: web

Ingress-grafana.yaml

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: grafana
  namespace: monitoring
spec:
  tls:
  - hosts:
    - grafana.192.168.1.15.nip.io
  rules:
  - host: grafana.192.168.1.15.nip.io
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: grafana
            port:
              name: http

Ingress-promethus.yaml

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: prometheus-k8s
  namespace: monitoring
spec:
  tls:
  - hosts:
    - prometheus.192.168.1.15.nip.io
  rules:
  - host: prometheus.192.168.1.15.nip.io
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: prometheus-k8s
            port: 
              name: web

Make sure you change the host

hugobloem commented 3 years ago

Hi there,

I am learning kubernetes, so I deployed a small cluster on some Raspberry Pis. However, I cannot reach my grafana instance (grafana.192.168.1.100.nip.io). I updated the ingress files following the post above, but to no avail.

Does anyone have any suggestions on what to do?

Cheers!

exArax commented 3 years ago

Hi,

I configured K3s with MetalLB and for some reason the ingress now doesn't work, is there a way to make the prometheus-k8s-0 to use the hostNetwork: true option? I have added in the prometheus-prometheus.yaml in the spec section but it doesn't seem to work

Fred0211 commented 3 years ago

Hello all, Thank you for making this project, especially for ARM users! I'm learning/running microk8s and have managed to get all nodes deployed and running. microk8s.kubectl get ingress --all-namespaces outputs that the hosts should be up and running.

image

However i'm not able to connect in browser. I'm aware microk8s isn't officially supported so unsure if it is an issue with this version of kubernetes. This has happened with and without applying the fixes for ingress-*.yml files.

Thank you!

pomcho555 commented 3 years ago

Thank you for this amazing project.

I've followed the tutorial here: https://kauri.io/#deploy-prometheus-and-grafana-to-monitor-a-kube/186a71b189864b9ebc4ef7c8a9f0a6b5/a

But I've found a fatal error while make deploy. I disabled the ingress in the vars.jsonnet but I still get the same error:

error validating "manifests/ingress-alertmanager.yaml": error validating data: [ValidationError(Ingress.spec.rules[0].http.paths[0].backend): unknown field "serviceName" in io.k8s.api.networking.v1.IngressBackend, ValidationError(Ingress.spec.rules[0].http.paths[0].backend): unknown field "servicePort" in io.k8s.api.networking.v1.IngressBackend]; if you choose to ignore these errors, turn validation off with --validate=false
error validating "manifests/ingress-grafana.yaml": error validating data: [ValidationError(Ingress.spec.rules[0].http.paths[0].backend): unknown field "serviceName" in io.k8s.api.networking.v1.IngressBackend, ValidationError(Ingress.spec.rules[0].http.paths[0].backend): unknown field "servicePort" in io.k8s.api.networking.v1.IngressBackend]; if you choose to ignore these errors, turn validation off with --validate=false
error validating "manifests/ingress-prometheus.yaml": error validating data: [ValidationError(Ingress.spec.rules[0].http.paths[0].backend): unknown field "serviceName" in io.k8s.api.networking.v1.IngressBackend, ValidationError(Ingress.spec.rules[0].http.paths[0].backend): unknown field "servicePort" in io.k8s.api.networking.v1.IngressBackend]; if you choose to ignore these errors, turn validation off with --validate=false

I have K3s version v1.20.7+k3s1 Thanks

I had a same error on k3s version v1.21.2+k3s1 (5a67e8dc) go version go1.16.4

There are 3 master node on EC2 and jetson nano(arm64) via the VPN network, and else are Raspberry Pi arm64.

$sudo kubectl get node
NAME               STATUS     ROLES                  AGE   VERSION
ip-xxx-xxx-xxx-xxx   Ready      control-plane,master   14d   v1.21.1+k3s1
ip-yyy-yyy-yyy-yyy    Ready      control-plane,master   14d   v1.21.1+k3s1
pi4-node2          Ready      <none>                 33m   v1.21.2+k3s1
jetson-master      Ready      control-plane,master   14d   v1.21.2+k3s1
pi4-node1          Ready      <none>                 37m   v1.21.2+k3s1

Thanks

onedr0p commented 3 years ago

I don't see how this can work with later version of k3s, since they disabled metrics listening on any interface other than 127.0.0.1

https://github.com/k3s-io/k3s/issues/425 https://github.com/k3s-io/k3s/commit/4808c4e7d53db310fb324b2157386e50ebef5167