carlosedp / cluster-monitoring

Cluster monitoring stack for clusters based on Prometheus Operator
MIT License
740 stars 200 forks source link

How to change Prometheus' scrapeInterval? #46

Closed geerlingguy closed 4 years ago

geerlingguy commented 4 years ago

It seems the default (https://prometheus.io/docs/prometheus/latest/configuration/configuration/) is 60s/1m; though I noticed one override for 30s in base_operator_stack.jsonnet for kubeStateMetrics.

It seems the Prometheus pod is a bit overloaded when it gets deployed to some of my older Pi 3 B boards (seems to run okay on the Pi 4 with more RAM and faster CPU though)... and I'm wondering if setting the scrape interval to something a bit more lightweight like 2m or 5m might help the poor older Pis keep up.

I was about to jump over to my prometheus instance but that node just went down due to thrashing as it ran out of memory 🤪

Anyways, just a quick support question, not a big deal and I may work on getting the memory requirements a little more stringent so Prometheus only goes to one of the newer/faster nodes.

geerlingguy commented 4 years ago

The node just came back online, so I checked the loaded Prometheus config (at http://prometheus.10.0.100.74.nip.io/config), and it shows:

global:
  scrape_interval: 30s
  scrape_timeout: 10s

I'd like to try 1m or 5m and see if the node can keep up a little better.

carlosedp commented 4 years ago

This is kinda needed on some cases because on slowest boards some targets time-out. It's not trivial to do since it would require overriding all serviceMonitor to the new values. Will think about it.

geerlingguy commented 4 years ago

It's not a major priority; for now the easiest solution is to taint slower boards in some way to make sure things like Prometheus don't get scheduled on them (or just upgrade them :D).

carlosedp commented 4 years ago

Let's see the feedback in https://github.com/coreos/prometheus-operator/pull/539#issuecomment-634237988. If it is OK, I will submit a PR allowing customization of the default timeout parameter.

The scraping interval can already be customized in the base_operator_stack.jsonnet file:

prometheus+:: {
    // Add option (from vars.yaml) to enable persistence
    local pvc = k.core.v1.persistentVolumeClaim,
    prometheus+: {
      spec+: {
               replicas: $._config.prometheus.replicas,
               retention: '15d',
               // Add scrapeInterval here after "retention"
               scrapeInterval: '5m',
...
carlosedp commented 4 years ago

Opened https://github.com/coreos/prometheus-operator/pull/3250 to address this.

Once it's merged, it's a matter of adding the scrapeTimeout parameter in the prometheus spec jsonnet like above.

carlosedp commented 4 years ago

The PR https://github.com/coreos/prometheus-operator/pull/3250 has been merged in the prometheus-operator project. Now it's a matter of waiting for the new release to update the libraries here.

carlosedp commented 4 years ago

Fixed in https://github.com/carlosedp/cluster-monitoring/commit/2ffe9ea31f551996aeb257968e2d4f31cee3c934.

Just check the new parameter in vars.jsonnet.