carlosedp / cluster-monitoring

Cluster monitoring stack for clusters based on Prometheus Operator
MIT License
740 stars 200 forks source link

prometheus-k8s fails to start after a while 'CrashLoopBackOff' #49

Closed buholzer closed 4 years ago

buholzer commented 4 years ago

Prometheus crashes and fails to start after running for some time. It looks like the TSDB is getting too large and Prometheus can't allocate any memory anymore (mmap: cannot allocate memory).

% kubectl logs prometheus-k8s-0 -p prometheus

level=info ts=2020-05-31T06:15:21.309Z caller=main.go:329 msg="Starting Prometheus" version="(version=2.11.1, branch=HEAD, revision=e5b22494857deca4b806f74f6e3a6ee30c251763)"
level=info ts=2020-05-31T06:15:21.309Z caller=main.go:331 host_details="(Linux 4.19.97-v7l+ #1294 SMP Thu Jan 30 13:21:14 GMT 2020 armv7l prometheus-k8s-0 (none))"
...
level=info ts=2020-05-31T06:15:21.313Z caller=main.go:652 msg="Starting TSDB ..."
...
level=info ts=2020-05-31T06:15:21.320Z caller=repair.go:59 component=tsdb msg="found healthy block" mint=1590494400000 maxt=1590501600000 ulid=01E99KKY0EHNP5Y85BWZKD85CX
level=info ts=2020-05-31T06:15:21.320Z caller=repair.go:59 component=tsdb msg="found healthy block" mint=1590501600000 maxt=1590508800000 ulid=01E99KMVH5A14VHEX6SGYZSM5R
level=info ts=2020-05-31T06:15:21.320Z caller=repair.go:59 component=tsdb msg="found healthy block" mint=1590508800000 maxt=1590516000000 ulid=01E9AS2F3X2Z2WARE081EYBFP4
...
level=info ts=2020-05-31T06:15:21.740Z caller=main.go:521 msg="Stopping scrape discovery manager..."
level=info ts=2020-05-31T06:15:21.740Z caller=main.go:535 msg="Stopping notify discovery manager..."
level=info ts=2020-05-31T06:15:21.740Z caller=main.go:557 msg="Stopping scrape manager..."
level=info ts=2020-05-31T06:15:21.740Z caller=manager.go:776 component="rule manager" msg="Stopping rule manager..."
level=info ts=2020-05-31T06:15:21.741Z caller=manager.go:782 component="rule manager" msg="Rule manager stopped"
level=info ts=2020-05-31T06:15:21.741Z caller=notifier.go:602 component=notifier msg="Stopping notification manager..."
level=info ts=2020-05-31T06:15:21.741Z caller=main.go:531 msg="Notify discovery manager stopped"
level=info ts=2020-05-31T06:15:21.741Z caller=main.go:722 msg="Notifier manager stopped"
level=info ts=2020-05-31T06:15:21.741Z caller=main.go:517 msg="Scrape discovery manager stopped"
level=info ts=2020-05-31T06:15:21.741Z caller=main.go:551 msg="Scrape manager stopped"
level=error ts=2020-05-31T06:15:21.741Z caller=main.go:731 err="opening storage failed: unexpected corrupted block:map[01E9AS2F3X2Z2WARE081EYBFP4:mmap files: mmap: cannot allocate memory]"

I have set the persistence settings to false in vars.jsonnet.

  // Setting these to false, defaults to emptyDirs
  enablePersistence: {
    prometheus: false,
    grafana: false,
  },

Is there an easy way to configure the TSDB retention behavior? https://prometheus.io/docs/prometheus/latest/storage/#operational-aspects

--storage.tsdb.retention.size: [EXPERIMENTAL] This determines the maximum number 
of bytes that storage blocks can use (note that this does not include the WAL size, 
which can be substantial). The oldest data will be removed first. Defaults to 0 or 
disabled. This flag is experimental and can be changed in future releases. Units 
supported: KB, MB, GB, PB. Ex: "512MB"
carlosedp commented 4 years ago

You need to edit the retention parameter to the operator definition.

Check the line:

https://github.com/carlosedp/cluster-monitoring/blob/40c9318d236bc8749fa1af27547c516dae9aad2d/base_operator_stack.jsonnet#L73

REBELinBLUE commented 4 years ago

I have it set to just 5d and I am still getting this error, would switching my RPI 4 4GB from Raspbian to a 64 bit OS help (been meaning to do this for a while anyway....)

carlosedp commented 4 years ago

Yes, use 64bit OSs.