carlosedp / cluster-monitoring

Cluster monitoring stack for clusters based on Prometheus Operator
MIT License
740 stars 201 forks source link

For questions, doubts, guidances please use Discussions. Don't open a new Issue. #91

Closed carlosedp closed 2 years ago

carlosedp commented 3 years ago

Since I don't have too many resources and time to address all questions regarding the deployments, the Issues section is a place to report problems or improvements to the stack.

This issue is a place where you can add a comment in case of a question where me or any community member can answer in a best effort manner.

If you deployed the monitoring stack and some targets are not available or showing no metrics in Grafana, make sure you don't have IPTables rules or use a firewall on your nodes before deploying Kubernetes.

If you don't want to receive further notifications, click "Unsubscribe" in the right bar, right above the participants list.

YushchenkoAndrew commented 3 years ago

I faced with an Issue, in which I couldn't open grafana and prometheus applications (link https://grafana.192.168.0.106.nip.io)

 $ curl http://prometheus.192.168.0.106.nip.io
 curl: (7) Failed to connect to prometheus.192.168.0.106.nip.io port 80: Connection refused
 $ curl https://prometheus.192.168.0.106.nip.io
 curl: (7) Failed to connect to prometheus.192.168.0.106.nip.io port 443: Connection refused

In the browser I got same Issue "Unable to connect".

I'm using k3s and I configured my master ip address 192.168.0.106 - it's a local ip address one of my workers node

I managed to successfully deploy all pods but I don't know how do I need to connect to the applications

 $ kubectl get ingress -n monitoring
 NAME                CLASS    HOSTS                               ADDRESS   PORTS     AGE
 alertmanager-main   <none>   alertmanager.192.168.0.106.nip.io             80, 443   54s
 grafana             <none>   grafana.192.168.0.106.nip.io                  80, 443   54s
 prometheus-k8s      <none>   prometheus.192.168.0.106.nip.io               80, 443   53s

 $ kubectl get pods -n monitoring
 NAME                                   READY   STATUS    RESTARTS   AGE
 prometheus-operator-6b8868d698-6xlvg   2/2     Running   0          14m
 arm-exporter-wmm6r                     2/2     Running   0          14m
 arm-exporter-67jpd                     2/2     Running   0          14m
 node-exporter-fbltt                    2/2     Running   0          14m
 alertmanager-main-0                    2/2     Running   0          14m
 arm-exporter-zhd5m                     2/2     Running   0          14m
 node-exporter-pzz6z                    2/2     Running   0          14m
 node-exporter-74fwt                    2/2     Running   0          14m
 grafana-7466bcc7c5-4hvpj               1/1     Running   0          14m
 kube-state-metrics-96bf99844-g9ssn     3/3     Running   0          14m
 prometheus-adapter-f78c4f4ff-kccbq     1/1     Running   0          14m
 prometheus-k8s-0                       3/3     Running   0          14m

Do you have any suggestions ?

carlosedp commented 3 years ago

You need to troubleshoot the access to your K3s cluster ingress that bridges the outside HTTP/HTTPS traffic to the pods.

Here is a reference: https://rancher.com/docs/k3s/latest/en/networking/

Have you deployed any application that uses HTTP(like NGINX, Apache) and been able to access it from your computer? It's similar to access Prometheus, Grafana and AlertManager.

YushchenkoAndrew commented 3 years ago

Yes, I created own blog site on JS but I didn't use ingress, I configured externalIP on Service, so... I will try to troubleshoot this issue. Thanks for replay!

YushchenkoAndrew commented 3 years ago

I solve this issue. Thank for advice, at the end I just installed nginx, configured it and after that I was able to access to prometheus and grafana. Thanks a lot!

johnfried commented 3 years ago

Love this project! I am unable to access prometheus.*.nip.io . I can access both Grafana and Alert manager. My ingress shows Prometheus, and is setup correctly. The one odd thing is when I look at all my pods in Monitoring ns; I do not have Prometheus-K8s (or something along those lines that I have seen in videos). The pods I have are Prometheus Adapter and Operator. I have Re-ran make vendor and deployed them. Same thing, again no errors anywhere. And also Prometheus-K8s has a service as I just looked. Does this make any sense? TIA

exArax commented 3 years ago

Is there a way to deploy grafana and prometheus pods to the master node only ? Because sometimes they are deployed to workers

carlosedp commented 3 years ago

@exArax You need to set your master nodes as schedulable. Even this way, Kubernetes can deploy the pods to other nodes. If you need to set to a specific set of nodes, you need pod affinity.

carlosedp commented 3 years ago

Love this project! I am unable to access prometheus.*.nip.io . I can access both Grafana and Alert manager. My ingress shows Prometheus, and is setup correctly. The one odd thing is when I look at all my pods in Monitoring ns; I do not have Prometheus-K8s (or something along those lines that I have seen in videos). The pods I have are Prometheus Adapter and Operator. I have Re-ran make vendor and deployed them. Same thing, again no errors anywhere. And also Prometheus-K8s has a service as I just looked. Does this make any sense? TIA

Doesn't make too much sense since the pods are created by the operator. Re-check your cluster and re-deploy the stack.

johnfried commented 3 years ago

I redeployed and all is well, thank you

exArax commented 3 years ago

@exArax You need to set your master nodes as schedulable. Even this way, Kubernetes can deploy the pods to other nodes. If you need to set to a specific set of nodes, you need pod affinity.

In case of grafana, I have to add the node affinity on the grafana-deployment.yaml that is inside the manifests folder, right?

ClauNav commented 3 years ago

Hello Carlos, I've the same issue as YushchenkoAndrew. I'm noob on Kubernetes (I built this cluster to learn about it) Screenshot from 2020-09-23 00-13-04

The same issue on Alertmanager/Prometheus.

Could you please help me?

Thanks.

carlosedp commented 3 years ago

@exArax You need to set your master nodes as schedulable. Even this way, Kubernetes can deploy the pods to other nodes. If you need to set to a specific set of nodes, you need pod affinity.

In case of grafana, I have to add the node affinity on the grafana-deployment.yaml that is inside the manifests folder, right?

Yes, since the jsonnet code doesn't have the pod affinity for this.

carlosedp commented 3 years ago

Hello Carlos, I've the same issue as YushchenkoAndrew. I'm noob on Kubernetes (I built this cluster to learn about it) Screenshot from 2020-09-23 00-13-04

The same issue on Alertmanager/Prometheus.

Could you please help me?

Thanks.

You need to make sure your Kubernetes cluster has an Ingress controller and can expose the applications. Check this first with something like an NGINX pod with a simple Hello World web page.

Nenad13 commented 3 years ago

Hi Carlos, Very cool project indeed. I am running Kubernetes on Ubuntu 20.04.1 (master) and a few of Raspberry Pi 4 (nodes) with raspbian on them. I installed Kubernetes with ansible playbook and it works fine. I made all changes in vars.jsonnet as you suggested. The problem is after. make deploy I am getting this error:

root@asus:~/cluster-monitoring# make deploy echo "Deploying stack setup manifests..." Deploying stack setup manifests... kubectl apply -f ./manifests/setup/ The connection to the server localhost:8080 was refused - did you specify the right host or port? make: *** [Makefile:37: deploy] Error 1

Do you have any suggestions?

This is the configuration: kubectl config view apiVersion: v1 clusters:

Thank you in advance!

ClauNav commented 3 years ago

Hello Carlos, I've the same issue as YushchenkoAndrew. I'm noob on Kubernetes (I built this cluster to learn about it) Screenshot from 2020-09-23 00-13-04 The same issue on Alertmanager/Prometheus. Could you please help me? Thanks.

You need to make sure your Kubernetes cluster has an Ingress controller and can expose the applications. Check this first with something like an NGINX pod with a simple Hello World web page.

Hello Carlos, You're right! Thanks for taking your time replaying our newbies questions.

riolaf05 commented 3 years ago

Hello, I have some problems with installation on K3s.

After the deploy operation, not all the services are installed:

image

Also, I am getting this error from the prometheus adapted container:

image

Do you have any idea on what can I I do? Thank you.

carlosedp commented 3 years ago

Hello again,

I want to add some authentication and authorization on prometheus.192.168.1.x.nip.io. Is there a way to do something like this prometheus.io/docs/guides/tls-encryption on the prometheus.192.168.1.x.nip.io ?

You need an ingress controller that supports authentication. look at https://github.com/carlosedp/cluster-monitoring/blob/5ead7542d166a0f9b14ca911884a458b69c31951/base_operator_stack.jsonnet#L168. It works with Traefik but might need a couple changes.

carlosedp commented 3 years ago

Hello, I have some problems with installation on K3s.

After the deploy operation, not all the services are installed:

image

Also, I am getting this error from the prometheus adapted container:

image

Do you have any idea on what can I I do? Thank you.

Sorry, so many variables that it's hard to know. Start deploying a test application, check your node IPs and so on.

robmit68 commented 3 years ago

Hi Carlos, i have followup the Cluster Monitoring deployment step by step and is running successfully, i am trying to to utilize Prometheus generator withing the node prometheus.192.168.XXX.XXX.nip.io to generate a Cisco SNMP scrape config and i am not able to access the node via ssh. How can i access the node to add scrapes/targets to the Prometheus k3s node? i am newbie in k3s and looking forward to your response. Regards

Robe

carlosedp commented 3 years ago

Hi Carlos, i have followup the Cluster Monitoring deployment step by step and is running successfully, i am trying to to utilize Prometheus generator withing the node prometheus.192.168.XXX.XXX.nip.io to generate a Cisco SNMP scrape config and i am not able to access the node via ssh. How can i access the node to add scrapes/targets to the Prometheus k3s node? i am newbie in k3s and looking forward to your response. Regards

Robe

To collect metrics from SNMP you need the snmp_exporter. It's out of scope of this stack but take a look at another project I have here: https://github.com/carlosedp/ddwrt-monitoring. It's not in Kubernetes but I use it for SNMP.

robmit68 commented 3 years ago

Thank you Carlos

exArax commented 3 years ago

Hello again,

I want to add some authentication on prometheus.192.168.1.x.nip.io. Is there a way to do something like this https://prometheus.io/docs/guides/basic-auth/ or https://www.openshift.com/blog/adding-authentication-to-your-kubernetes-web-applications-with-keycloak on the prometheus.192.168.1.x.nip.io ? I do not know which file I have to edit to add authentication to Prometheus.

carlosedp commented 3 years ago

As I mentioned before, the stack doesn't have anything built-in to provide authentication but you could change the ingresses to you your ingress controller (Traefik, HAProxy, etc) to add a layer of authentication.

Another option is similar to the post you linked to but that would require adding the keycloak sidecar to every pod.

justinwagg commented 3 years ago

Firstly, thanks for all the work you put into this @carlosedp 👏🏻. Prometheus seems to be running into an error panic: mmap: cannot allocate memory, have you run into this before? Deleting the pod fixes the issue, and I do have memory available. Also - what is the best way to add additional targets? Thanks again

root@pi-master:/home/pi# kubectl version
Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.5+k3s1", GitCommit:"58ebdb2a2ec5318ca40649eb7bd31679cb679f71", GitTreeState:"clean", BuildDate:"2020-05-06T23:42:31Z", GoVersion:"go1.13.8", Compiler:"gc", Platform:"linux/arm"}
Server Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.5+k3s1", GitCommit:"58ebdb2a2ec5318ca40649eb7bd31679cb679f71", GitTreeState:"clean", BuildDate:"2020-05-06T23:42:31Z", GoVersion:"go1.13.8", Compiler:"gc", Platform:"linux/arm"}
root@pi-master:/home/pi#
root@pi-master:/home/pi# cat /etc/os-release
PRETTY_NAME="Raspbian GNU/Linux 10 (buster)"
NAME="Raspbian GNU/Linux"
VERSION_ID="10"
VERSION="10 (buster)"
VERSION_CODENAME=buster
ID=raspbian
ID_LIKE=debian
HOME_URL="http://www.raspbian.org/"
SUPPORT_URL="http://www.raspbian.org/RaspbianForums"
BUG_REPORT_URL="http://www.raspbian.org/RaspbianBugs"
root@pi-master:/home/pi#
exArax commented 3 years ago

@carlosedp to change the ingresses I have to edit only the ingress-XXXX.yaml files or are there more files that I have to edit ?

jontg commented 3 years ago

Hey @carlosedp I was wondering if you have any interest in seeing loki ("Prometheus, but for logs") added to this tech stack? I was thinking of taking a stab at it this coming Monday

thomazBDRI commented 3 years ago

Hey @carlosedp really thanks for this stack i am using this in a few clusters that i have! One question though, how do i add a new job into prometheus? I didn't find anything describing the jobs!

urbaned121 commented 3 years ago

Hey @carlosedp really thanks for this stack i am using this in a few clusters that i have! One question though, how do i add a new job into prometheus? I didn't find anything describing the jobs!

I came here with the same question... prometheus-config-reloader pod has directory /etc/prometheus/config where is prometheus.yaml.gz file but I have no idea hot to update it to add new job. I can not find config map related to that file. @carlosedp any advise? :) Thanks!

urbaned121 commented 3 years ago

Hey @carlosedp I was wondering if you have any interest in seeing loki ("Prometheus, but for logs") added to this tech stack? I was thinking of taking a stab at it this coming Monday

I have already installed loki via helm on my k3s cluster and it seems to work. But If you need to modify values separately for loki and prom tail I suggest to install them separately as well. helm repo add loki https://grafana.github.io/loki/charts helm repo update helm install -f loki-values.yaml -n monitoring loki loki/loki helm install -f promtail-values.yaml -n monitoring promtail loki/promtail surprisingly loki working well on arm64 🙃

thomazBDRI commented 3 years ago

Hello @urbaned121, i'm really new into Kubernetes and Monitoring :smile: so i keep digging here and found out that we have 2 ways of describing a prometheus "scrap" (i don't know the correct term here), one is to define jobs and make him get the metrics with the value that we set, and the other one using the prometheus-operator (which this repo use), we need to define a ServiceMonitor which Prometheus will listen into and get the metrics!

Here we have the full documentation on how to do this: https://github.com/prometheus-operator/prometheus-operator. For me i was using the helm chart bitnami/mongodb and in the configurations of metrics there is already a field to start a ServiceMonitor and now it is perfectly working!

Sorry @carlosedp for such noob questions, but we will keep learning! Maybe adding this info into the Readme? I don't know if this was too obvious, thanks again for the stack!

polds commented 3 years ago

How do you update everything once everything is running? I tried adding a new module and if I do a make deploy again I get an error that the PVC is immutable and cannot be changed. I don't wish to tear everything down just to add a new module.

exArax commented 3 years ago

Hello,

I have done basic auth like you suggested and I used Traefik. Now I want to test how to add to ingress-prometheus.yaml a ip whitelist so only the master node of K3s has access to Prometheus. I found the way that how to perform this action i use in annotations traefik.ingress.kubernetes.io/whitelist-source-range: "192.168.1.2" which is the ip of my K3s master node but I am getting Forbidden when I try to access Prometheus from the master node. Do you have any idea what I am doing wrong ?

carlosedp commented 3 years ago

@exArax Yes, only the ingress-* files need change.

carlosedp commented 3 years ago

Hey @carlosedp I was wondering if you have any interest in seeing loki ("Prometheus, but for logs") added to this tech stack? I was thinking of taking a stab at it this coming Monday

@jontg I'm looking into it. Also into Grafana Tempo but I don't know if it's related to the "monitoring" stack.

carlosedp commented 3 years ago

@thomazBDRI @urbaned121 to add new jobs, you define a ServiceMonitor pointing to your service. Look into the modules dir where I have definitions to different collectors even external ones like for UPS.

carlosedp commented 3 years ago

@polds if you only enable a module, running make and make deploy should work since the other manifests didn't change.

carlosedp commented 3 years ago

@exArax why would you want a cluster node to block/access Prometheus?

exArax commented 3 years ago

@carlosedp I have developed a rest api which performs queries to Prometheus and I want only this api to have access to Prometheus endpoints.

carlosedp commented 3 years ago

If the application that queries is internal to the cluster, you don't need the ingress exposing prometheus outside the cluster. Have your application call the Prometheus service directly from inside the cluster like prometheus-k8s.monitoring.svc.cluster.local

jjo93sa commented 3 years ago

@carlosedp HI, firstly thanks for this great repo. I came across it through @geerlingguy tutorial on monitoring the Turing Pi cluster. He seemingly had no problems setting it up. I've got everything deployed AOK, but apart from temperature, there were no stats in the Grafana dashboard. I checked Prometheus, and saw State "DOWN" on all the nodes in my cluster:

Screenshot 2020-10-28 at 13 48 34

I manually opened up TCP/9100 on the IPtables on one node, and the data started flowing. All well and good, but I was surprised for the need to manually open the ports, and wanted to check if I'd missed something? Especially given that I had no need to open ports for the ingress access to Grafana, Prometheus etc. (@geerlingguy didn't report doing the same thing, but he was using HypriotOS, whereas I'm using Ubuntu 20.04.1 along with a standard set of IPtables rules rolled out with Ansible.) I'm new to K8s, but I guess I anticipated the firewall rules being controlled by the cluster?

dicastro commented 3 years ago

@carlosedp First of all I'd like to thank you for this great repo. I've achieved to have everything working in my k3s cluster without any major issue. I got here thanks to @geerlingguy.

Once I have it working I am analyzing the code in depth to try to understand everything it does. Diving into code I've seen that 'ksonnet' is used widely and that it has been discontinued. Did you realised that? Are you planning to replace that library? Do you think it is worth to replace it? Do you know any alternative?

carlosedp commented 3 years ago

@jjo93sa usually on Kubernetes clusters, we don't set IPTables rules so they don't mess with Kubernetes rules and block required ports.

carlosedp commented 3 years ago

@dicastro Yes, ksonnet has been discontinued for a while but the libraries has been maintained by prometheus-operator team for a while and there are talks to migrate it from the original project.

There is also Tanka from Grafana that also uses jsonnet as it's language but don't have all libraries ksonnet has.

There is many teams and projects using it currently so it's still not a problem.

jjo93sa commented 3 years ago

@carlosedp - Thanks. So people run their K8s clusters without any firewalls on the servers? That’s an interesting paradigm shift, indeed!

60 % of the time, it works every time

On 28 Oct 2020, at 19:27, Carlos Eduardo notifications@github.com wrote:

 @jjo93sa usually on Kubernetes clusters, we don't set IPTables rules so they don't mess with Kubernetes rules and block required ports.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

wargfn commented 3 years ago

Anyone get the CPU temperature panel working with the Raspberry Pis? I feel like I am missing a piece of the data puzzle to get this working. Like a shell script that needs to be logging to the syslog.

jjo93sa commented 3 years ago

@wargfn I had no problems with CPU temperature panel on my RPi cluster. In fact, it was the only thing working for a while. Did you modify the vars.jsonnet file to enable the arm_exporter?

    {                                                                                                                                                        
      name: 'armExporter',                                                                                                                                   
      enabled: true,                                                                                                                                         
      file: import 'modules/arm_exporter.jsonnet',                                                                                                           
    },                                                                                                                                                       
jjo93sa commented 3 years ago

@carlosedp I've run into an issue where the ingresses often change location, and I can no longer load the Grafana page when that happens:

[2020-10-31 09:17:31+2 ✘][~]
[james@tpin1][$ sudo kubectl get ingress -o wide -n monitoring
NAME                CLASS    HOSTS                             ADDRESS       PORTS     AGE
grafana             <none>   grafana.10.10.50.24.nip.io        10.10.50.23   80, 443   3d14h
alertmanager-main   <none>   alertmanager.10.10.50.24.nip.io   10.10.50.23   80, 443   3d14h
prometheus-k8s      <none>   prometheus.10.10.50.24.nip.io     10.10.50.23   80, 443   3d14h

I tried to run the make target to update the ingress suffix, but wasn't quite sure if that was the right command to fix the problem?

rur0 commented 3 years ago

I've run into a problem after a fresh install of cluster-monitoring, some pods do not come online

$ kubectl get pods -n monitoring
NAME                                  READY   STATUS                 RESTARTS   AGE
alertmanager-main-0                   2/2     Running                0          17m
arm-exporter-2thrp                    2/2     Running                0          18m
arm-exporter-5mwtb                    2/2     Running                0          18m
arm-exporter-87lqv                    2/2     Running                0          18m
arm-exporter-bkfhp                    2/2     Running                0          18m
arm-exporter-g4lx7                    2/2     Running                0          18m
arm-exporter-l8cqn                    0/2     ContainerCreating      0          18m
arm-exporter-qrdsr                    2/2     Running                0          18m
arm-exporter-xwk8k                    2/2     Running                0          18m
grafana-784d46dcb-6bbsr               0/1     CreateContainerError   1          18m
kube-state-metrics-6cb6df5d4-whhnl    3/3     Running                0          18m
node-exporter-4m82x                   2/2     Running                0          18m
node-exporter-6gvnl                   2/2     Running                0          18m
node-exporter-g9kcg                   2/2     Running                0          18m
node-exporter-q9f4r                   2/2     Running                0          18m
node-exporter-qb4tn                   1/2     CreateContainerError   0          18m
node-exporter-r7k7m                   2/2     Running                0          18m
node-exporter-th9w7                   2/2     Running                1          18m
node-exporter-xmzxb                   2/2     Running                0          18m
prometheus-adapter-585b57857b-9mzq8   1/1     Running                0          18m
prometheus-k8s-0                      2/3     Running                1          14m
prometheus-operator-67755f959-8cm5d   2/2     Running                0          18m
$ kubectl describe pod/arm-exporter-l8cqn -n monitoring
Events:
  Type     Reason                  Age                    From               Message
  ----     ------                  ----                   ----               -------
  Normal   Scheduled               17m                    default-scheduler  Successfully assigned monitoring/arm-exporter-l8cqn to node-6
  Warning  FailedCreatePodSandBox  10m (x12 over 13m)     kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to reserve sandbox name "arm-exporter-l8cqn_monitoring_8fc2fd8c-8ed4-405e-a331-c39316657e7a_0": name "arm-exporter-l8cqn_monitoring_8fc2fd8c-8ed4-405e-a331-c39316657e7a_0" is reserved for "a6a25d65277ae3272f08583eb53ad261ea48197e477d7872f5b2d4d7806de78c"
  Warning  FailedCreatePodSandBox  6m17s (x2 over 13m)    kubelet            Failed to create pod sandbox: rpc error: code = DeadlineExceeded desc = context deadline exceeded
  Warning  FailedCreatePodSandBox  3m13s (x14 over 6m4s)  kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to reserve sandbox name "arm-exporter-l8cqn_monitoring_8fc2fd8c-8ed4-405e-a331-c39316657e7a_0": name "arm-exporter-l8cqn_monitoring_8fc2fd8c-8ed4-405e-a331-c39316657e7a_0" is reserved for "1354415904d5f6d8c4344f00bea58ff5d5b0956109954975c5f187ab463ed1db"
$ kubectl describe pod/grafana-784d46dcb-6bbsr -n monitoring
Events:
  Type     Reason       Age                 From               Message
  ----     ------       ----                ----               -------
  Normal   Scheduled    19m                 default-scheduler  Successfully assigned monitoring/grafana-784d46dcb-6bbsr to node-7
  Warning  FailedMount  19m                 kubelet            MountVolume.SetUp failed for volume "grafana-dashboard-pod-total" : failed to sync configmap cache: timed out waiting for the condition
  Warning  FailedMount  19m                 kubelet            MountVolume.SetUp failed for volume "grafana-dashboard-k8s-resources-node" : failed to sync configmap cache: timed out waiting for the condition
  Warning  FailedMount  19m                 kubelet            MountVolume.SetUp failed for volume "grafana-dashboard-controller-manager" : failed to sync configmap cache: timed out waiting for the condition
  Warning  FailedMount  19m                 kubelet            MountVolume.SetUp failed for volume "grafana-dashboard-k8s-resources-workloads-namespace" : failed to sync configmap cache: timed out waiting for the condition
  Warning  FailedMount  19m                 kubelet            MountVolume.SetUp failed for volume "grafana-dashboard-node-rsrc-use" : failed to sync configmap cache: timed out waiting for the condition
  Warning  FailedMount  19m                 kubelet            MountVolume.SetUp failed for volume "grafana-dashboard-workload-total" : failed to sync configmap cache: timed out waiting for the condition
  Warning  FailedMount  19m                 kubelet            MountVolume.SetUp failed for volume "grafana-dashboard-kubernetes-cluster-dashboard" : failed to sync configmap cache: timed out waiting for the condition
  Warning  FailedMount  19m (x2 over 19m)   kubelet            MountVolume.SetUp failed for volume "grafana-dashboard-namespace-by-workload" : failed to sync configmap cache: timed out waiting for the condition
  Warning  FailedMount  19m (x2 over 19m)   kubelet            MountVolume.SetUp failed for volume "grafana-dashboard-cluster-total" : failed to sync configmap cache: timed out waiting for the condition
  Warning  FailedMount  19m                 kubelet            MountVolume.SetUp failed for volume "grafana-dashboard-prometheus-remote-write" : failed to sync configmap cache: timed out waiting for the condition
  Warning  FailedMount  19m (x10 over 19m)  kubelet            (combined from similar events): MountVolume.SetUp failed for volume "grafana-dashboard-node-cluster-rsrc-use" : failed to sync configmap cache: timed out waiting for the condition
  Normal   Pulling      19m                 kubelet            Pulling image "grafana/grafana:7.0.3"
  Normal   Pulled       6m                  kubelet            Successfully pulled image "grafana/grafana:7.0.3" in 13m3.146093344s
  Warning  Failed       4m                  kubelet            Error: context deadline exceeded
  Warning  Failed       4m                  kubelet            Error: failed to reserve container name "grafana_grafana-784d46dcb-6bbsr_monitoring_b2e1fd49-0136-45b3-a3fe-a867598e523f_0": name "grafana_grafana-784d46dcb-6bbsr_monitoring_b2e1fd49-0136-45b3-a3fe-a867598e523f_0" is reserved for "7a610571de0c37b102bff0410070a5a2df12b0f774562acbef4bc5d60d9131ff"
  Normal   Pulled       3m45s (x2 over 4m)  kubelet            Container image "grafana/grafana:7.0.3" already present on machine
$ kubectl describe pod/node-exporter-qb4tn -n monitoring
Events:
  Type     Reason                  Age                  From               Message
  ----     ------                  ----                 ----               -------
  Normal   Scheduled               19m                  default-scheduler  Successfully assigned monitoring/node-exporter-qb4tn to node-6
  Warning  FailedCreatePodSandBox  15m                  kubelet            Failed to create pod sandbox: rpc error: code = DeadlineExceeded desc = context deadline exceeded
  Warning  FailedCreatePodSandBox  13m (x11 over 15m)   kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to reserve sandbox name "node-exporter-qb4tn_monitoring_26f2b354-22bd-49e9-aea6-19d37eaa3a43_0": name "node-exporter-qb4tn_monitoring_26f2b354-22bd-49e9-aea6-19d37eaa3a43_0" is reserved for "11e23c3316e0e94939f141764a044cee45f977415b7c24ce6ea791d8b825cc0c"
  Normal   Pulled                  13m                  kubelet            Container image "prom/node-exporter:v0.18.1" already present on machine
  Normal   Created                 12m                  kubelet            Created container node-exporter
  Normal   Started                 12m                  kubelet            Started container node-exporter
  Warning  Failed                  10m                  kubelet            Error: context deadline exceeded
  Warning  Failed                  9m36s (x4 over 10m)  kubelet            Error: failed to reserve container name "kube-rbac-proxy_node-exporter-qb4tn_monitoring_26f2b354-22bd-49e9-aea6-19d37eaa3a43_0": name "kube-rbac-proxy_node-exporter-qb4tn_monitoring_26f2b354-22bd-49e9-aea6-19d37eaa3a43_0" is reserved for "72de9646f710919d1e3a5cb0f7505837f499c6b4e76092b3ce9d94be9dcaa15e"
  Normal   Pulled                  51s (x35 over 12m)   kubelet            Container image "carlosedp/kube-rbac-proxy:v0.5.0" already present on machine

I am running k3s version v1.19.3+k3s2 (f8a4547b) on a 8 RPI4 node cluster with HA (3 master, 5 worker). Has anyone encountered this issue? Thanks.

dicastro commented 3 years ago

Previously I said that I managed to install everything on K3s raspberry pi cluster, but I've realised that it is not true. I am having some issues with kube-scheduler and kube-controller-manager.

Firstly I've seen that alarms about kube-scheduler and kube-controller-manager are always firing. Trying to investigate why this was happening, I've seen that kube-scheduler and kube-controller-manager metrics are not being recovered.

I've already read these issues #13, #20 and #56 .

At the begining the targets in the prometheus were empty:

prometheus_metrics_failing_01

After re-applying the manifests indicated in one of the previous issues (prometheus-kubeSchedulerPrometheusDiscoveryEndpoints.yaml and prometheus-kubeControllerManagerPrometheusDiscoveryEndpoints.yaml), the targets appeared in Prometheus, but with status DOWN and with a Connection refused error

prometheus_metrics_failing

(these two targets are the only ones failing, the rest are working without any issue)

After some hours the situation is reverted and the targets disappear again.

What else can I do/try/check?

polds commented 3 years ago

How can I increase the resource limits of the Grafana deployments? My grafana containers keep getting killed for consuming too much memory and I'd like to give them a little more padding.

I tried to do something similar to #84 but it did nothing to the generated manifests.

This is what I tried:

    grafana+:: {
      local statefulSet = k.apps.v1.statefulSet,
      local container = statefulSet.mixin.spec.template.spec.containersType,
      local resourceRequirements = container.mixin.resourcesTypes,

      spec+:: {
        resources: resourceRequirements.New() +
          resourceRequirements.withRequests({ cpu: '200m', memory: '120Mi' },) +
          resourceRequirements.withLimits({ cpu: '500m', memory: '280Mi' },)
      },