carlosedp / cluster-monitoring

Cluster monitoring stack for clusters based on Prometheus Operator
MIT License
740 stars 200 forks source link

prometheus-k8s-0 CrashLoopBackOff after several days of running #78

Closed McFuzz89 closed 4 years ago

McFuzz89 commented 4 years ago

Hello!

Disclaimer: total kubernetes noob; followed Jeff Geerlings guide on deploying prometheus and grafana on a RPI4 cluster as outlined in his blog.

When I did my original deployment, I noticed an issue where grafana was showing no data in the dashboard. I thought it was my fault as I was messing around with things. Decided to do a fresh deployment by deleting the entire monitoring namespace (where everything for cluster monitoring was setup) and re-deploying.

Everything worked fine but several days ago, it started doing the same constant crash. Please see pod description below which seems to also provide the error associated with the crash (Exit Code 2). Would appreciate assistance in resolving or, if any more data is needed - please let me know! Thanks!!

Name:         prometheus-k8s-0
Namespace:    monitoring
Priority:     0
Node:         fuzzykube-worker3/10.33.1.13
Start Time:   Fri, 10 Jul 2020 01:57:33 -0700
Labels:       app=prometheus
              controller-revision-hash=prometheus-k8s-7ffbdcdd76
              prometheus=k8s
              statefulset.kubernetes.io/pod-name=prometheus-k8s-0
Annotations:  <none>
Status:       Running
IP:           10.42.1.40
IPs:
  IP:           10.42.1.40
Controlled By:  StatefulSet/prometheus-k8s
Containers:
  prometheus:
    Container ID:  containerd://6d1e9d6da84eb6fb6fb135886ac4f48e09db5d58e33bf7017b2d8ac58c7b8599
    Image:         prom/prometheus:v2.19.1
    Image ID:      docker.io/prom/prometheus@sha256:efe62fa8804e9fd2612a945b70c630cc27e21b5fb8233ccc8be4cfbe06d26b04
    Port:          9090/TCP
    Host Port:     0/TCP
    Args:
      --web.console.templates=/etc/prometheus/consoles
      --web.console.libraries=/etc/prometheus/console_libraries
      --config.file=/etc/prometheus/config_out/prometheus.env.yaml
      --storage.tsdb.path=/prometheus
      --storage.tsdb.retention.time=15d
      --web.enable-lifecycle
      --storage.tsdb.no-lockfile
      --web.external-url=http://prometheus.xxx.nip.io
      --web.route-prefix=/
    State:       Waiting
      Reason:    CrashLoopBackOff
    Last State:  Terminated
      Reason:    Error
      Message:   rentHeadChunk(0xa27e840, 0x54de120)
                 /app/tsdb/head.go:1991 +0x22c
github.com/prometheus/prometheus/tsdb.(*memSeries).cutNewHeadChunk(0xa27e840, 0x6f51fa05, 0x173, 0x54de120, 0x1)
  /app/tsdb/head.go:1962 +0x24
github.com/prometheus/prometheus/tsdb.(*memSeries).append(0xa27e840, 0x6f51fa05, 0x173, 0x0, 0x41ad1ce0, 0x0, 0x0, 0x54de120, 0x1)
  /app/tsdb/head.go:2118 +0x3a4
github.com/prometheus/prometheus/tsdb.(*Head).processWALSamples(0x4ec4100, 0x6f195d00, 0x173, 0xd7664c0, 0xd766480, 0x0, 0x0)
  /app/tsdb/head.go:365 +0x284
github.com/prometheus/prometheus/tsdb.(*Head).loadWAL.func5(0x4ec4100, 0xa553770, 0xa553780, 0xd7664c0, 0xd766480)
  /app/tsdb/head.go:459 +0x3c
created by github.com/prometheus/prometheus/tsdb.(*Head).loadWAL
  /app/tsdb/head.go:458 +0x268
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0xc pc=0x1532d88]

goroutine 245 [running]:
bufio.(*Writer).Available(...)
  /usr/local/go/src/bufio/bufio.go:608
github.com/prometheus/prometheus/tsdb/chunks.(*ChunkDiskMapper).WriteChunk(0x54de120, 0x139b9, 0x0, 0x6f1b0b85, 0x173, 0x6f510fa5, 0x173, 0x240cee0, 0xa311820, 0x0, ...)
  /app/tsdb/chunks/head_chunks.go:252 +0x500
github.com/prometheus/prometheus/tsdb.(*memSeries).mmapCurrentHeadChunk(0xa27e8f0, 0x54de120)
  /app/tsdb/head.go:1988 +0x6c
github.com/prometheus/prometheus/tsdb.(*memSeries).cutNewHeadChunk(0xa27e8f0, 0x6f51fa05, 0x173, 0x54de120, 0x1)
  /app/tsdb/head.go:1962 +0x24
github.com/prometheus/prometheus/tsdb.(*memSeries).append(0xa27e8f0, 0x6f51fa05, 0x173, 0x0, 0x41b64e90, 0x0, 0x0, 0x54de120, 0x10001)
  /app/tsdb/head.go:2118 +0x3a4
github.com/prometheus/prometheus/tsdb.(*Head).processWALSamples(0x4ec4100, 0x6f195d00, 0x173, 0xd766540, 0xd766500, 0x0, 0x0)
  /app/tsdb/head.go:365 +0x284
github.com/prometheus/prometheus/tsdb.(*Head).loadWAL.func5(0x4ec4100, 0xa553770, 0xa553780, 0xd766540, 0xd766500)
  /app/tsdb/head.go:459 +0x3c
created by github.com/prometheus/prometheus/tsdb.(*Head).loadWAL
  /app/tsdb/head.go:458 +0x268

      Exit Code:    2
      Started:      Mon, 27 Jul 2020 14:02:55 -0700
      Finished:     Mon, 27 Jul 2020 14:03:04 -0700
    Ready:          False
    Restart Count:  1918
    Requests:
      memory:     400Mi
    Liveness:     http-get http://:web/-/healthy delay=0s timeout=3s period=5s #success=1 #failure=6
    Readiness:    http-get http://:web/-/ready delay=0s timeout=3s period=5s #success=1 #failure=120
    Environment:  <none>
    Mounts:
      /etc/prometheus/certs from tls-assets (ro)
      /etc/prometheus/config_out from config-out (ro)
      /etc/prometheus/rules/prometheus-k8s-rulefiles-0 from prometheus-k8s-rulefiles-0 (rw)
      /prometheus from prometheus-k8s-db (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from prometheus-k8s-token-5mjmr (ro)
  prometheus-config-reloader:
    Container ID:  containerd://fb3474713bbb01391164e6f0874b3034f0672e353d8b6c054b95c99a673a52c7
    Image:         carlosedp/prometheus-config-reloader:v0.40.0
    Image ID:      docker.io/carlosedp/prometheus-config-reloader@sha256:218f9f49a51a072af66ac67696c092a4962fd5108cd5525dbbcea5c239fe3862
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/prometheus-config-reloader
    Args:
      --log-format=logfmt
      --reload-url=http://localhost:9090/-/reload
      --config-file=/etc/prometheus/config/prometheus.yaml.gz
      --config-envsubst-file=/etc/prometheus/config_out/prometheus.env.yaml
    State:          Running
      Started:      Fri, 10 Jul 2020 01:57:51 -0700
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     100m
      memory:  25Mi
    Requests:
      cpu:     100m
      memory:  25Mi
    Environment:
      POD_NAME:  prometheus-k8s-0 (v1:metadata.name)
    Mounts:
      /etc/prometheus/config from config (rw)
      /etc/prometheus/config_out from config-out (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from prometheus-k8s-token-5mjmr (ro)
  rules-configmap-reloader:
    Container ID:  containerd://1294639831eb3e37a598e85f1d5d4043d9d3b441bf3acdaa380820178ece67cd
    Image:         carlosedp/configmap-reload:latest
    Image ID:      docker.io/carlosedp/configmap-reload@sha256:cd9f05743ab6024e445ea6e0da4416122eae5e1d0149dd33232be0601096c8d4
    Port:          <none>
    Host Port:     <none>
    Args:
      --webhook-url=http://localhost:9090/-/reload
      --volume-dir=/etc/prometheus/rules/prometheus-k8s-rulefiles-0
    State:          Running
      Started:      Fri, 10 Jul 2020 01:57:52 -0700
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     100m
      memory:  25Mi
    Requests:
      cpu:        100m
      memory:     25Mi
    Environment:  <none>
    Mounts:
      /etc/prometheus/rules/prometheus-k8s-rulefiles-0 from prometheus-k8s-rulefiles-0 (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from prometheus-k8s-token-5mjmr (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  config:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  prometheus-k8s
    Optional:    false
  tls-assets:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  prometheus-k8s-tls-assets
    Optional:    false
  config-out:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  prometheus-k8s-rulefiles-0:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      prometheus-k8s-rulefiles-0
    Optional:  false
  prometheus-k8s-db:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  prometheus-k8s-token-5mjmr:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  prometheus-k8s-token-5mjmr
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  kubernetes.io/os=linux
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason   Age                        From                        Message
  ----     ------   ----                       ----                        -------
  Normal   Pulled   57m (x1908 over 17d)       kubelet, fuzzykube-worker3  Container image "prom/prometheus:v2.19.1" already present on machine
  Warning  BackOff  2m45s (x44716 over 7d23h)  kubelet, fuzzykube-worker3  Back-off restarting failed container
rossmckelvie commented 4 years ago

read the same article, having the same issue. I think the resources need to be adjusted, for the stateful set "prometheus-k8s", I'm seeing Memory Usage 180.5Mi, Requested 450Mi, and Limit 50Mi. CPU is slightly over the requested & limit of 200m at 238.61m.

carlosedp commented 4 years ago

Exactly, this might be because the amount of memory Prometheus is using. If you are using a memory limited device like a RPi with 1 or 2GB RAM it might be the problem. I don't think there is a way to limit it's memory usage.

Closing as it's not a monitoring stack issue.

rossmckelvie commented 4 years ago

I'm using an 8gb pi, do you know if the script is setting the limits in the manifests? It's not a limit I set.

carlosedp commented 4 years ago

By default, Prometheus has a Resource request for 400Mi set in the upstream library (github.com/coreos/kube-prometheus/jsonnet/kube-prometheus/prometheus/prometheus.libsonnet). You might try to override it and test but it's out of scope here.

McFuzz89 commented 4 years ago

read the same article, having the same issue. I think the resources need to be adjusted, for the stateful set "prometheus-k8s", I'm seeing Memory Usage 180.5Mi, Requested 450Mi, and Limit 50Mi. CPU is slightly over the requested & limit of 200m at 238.61m.

How did you check the utilization? When I try to do kubectl top pod - it fails :(

edit: modified prometheus-prometheus.yaml to add some limits and that seems, at least for the moment, to have done the trick as the pod does not crash anymore...

resources:
    requests:
      memory: 400Mi
      cpu: 300m
    limits:
      memory: 600Mi
      cpu: 400m
rossmckelvie commented 4 years ago

@McFuzz89 I use https://infra.app/ for monitoring my homelab cluster

rossmckelvie commented 4 years ago

Yeah those increased limits were something I was hoping we could fix in this project for others, but seems like we should just be making them after the initial setup is done. I also made some changes to the generated files to enable anonymous mode in grafana for a kiosk in my office, and move the CPU temperature towards the top to fit. The grafana.ini file is base64 encoded in the grafana-configmap, but you can easily decode->modify->encode and run kubectl apply -f grafana-configmap.yaml

McFuzz89 commented 4 years ago

@rossmckelvie - thanks for the tip! Based on historical data, it seems prometheus crashes after about a day of operation - so come tomorrow evening my time, I should know if the changes worked.

rossmckelvie commented 4 years ago

@McFuzz89 any luck?