HostHighDiskWriteRate threshold (50M/s) is too low

przemeklal commented 10 months ago

Bug Description

This threshold seems to be too low, especially for NVME drives used as bcache on busy Ceph clusters. Write rates around 50MB/s are pretty normal, the default threshold should be closer to at least 100 MB/s. Also, for: 5m seems to be too aggressive in production, I'd suggest increasing this to at least 20m.

We see a lot of flapping, and false positives and the alert itself is not actionable.

To Reproduce

Deploy grafana-agent and start writing data >50MB/s :)

Environment

grafana-agent rev 29 on focal

Relevant log output

{agent_hostname="redacted", device="nvme0c0n1", instance="redacted/24", job="redacted_grafana-agent-host_node-exporter", juju_application="redacted", juju_model="redacted", juju_model_uuid="redacted", juju_unit="redacted/24"}

62.13687337239583

Additional context

No response

lucabello commented 9 months ago

This is an alert rule for which is hard to be correctly opinionated about: see https://cloud.google.com/compute/docs/disks/performance#pd-ssd.

We'll think on how to change that or maybe remove the check entirely.

przemeklal commented 8 months ago

The same comment goes for the HostHighDiskReadRate alert rule.

@lucabello I believe you're right and it's a good idea to remove these checks since it's impossible to find a reasonable, universal threshold. These alert rules can always be added using the cos-configuration charm on clusters where this may be important.

err404r commented 5 months ago

Also this alert rule is generated for LXD containers which make 0 sense, so +1 for removal of the check. Or move threshold definition to charm configuration...

canonical / grafana-agent-operator