No alerts are triggered when scraping job times out

przemeklal commented 10 months ago

In a real-life deployment, scraping hardware-exporter may take more than 10s, e.g.

time curl localhost:10000/metrics
...
real    0m25.534s
user    0m0.008s
sys 0m0.000s

By default, grafana-agent's scrape_timeout is 10s. If the scrape takes more than 10s, it fails silently, there are no alerts about missing metrics and any firing alerts get resolved because there are no metrics in Prometheus since grafana-agent doesn't push them.

As a reliability engineer, I didn't know this was happening until I ran queries in Prometheus to check whether the metrics were there.

The workaround is to bump global_scrape_timeout in grafana-agent juju config to something like 60s to make sure the scrape jobs are actually executed.

jneo8 commented 10 months ago

We can try to add timeout mechanism on exporter to have this timeout metrics.

aieri commented 4 weeks ago

This may be fixed by https://github.com/canonical/grafana-agent-operator/pull/147

canonical / hardware-observer-operator

No alerts are triggered when scraping job times out #126