Modify the oomkill.rule to include a longer duration

Enhancement Proposal

Change the oomkill.rule to include a duration greater than 1m in the range vector, e.g. 1h. So change the rule expression from

expr: increase(node_vmstat_oom_kill[1m]) > 0

expr: increase(node_vmstat_oom_kill[1h]) > 0

Reason: We had several OOM kills for some units and did not get alerted because the network was not working properly after the OOM, so the increase between a minute was never higher than 0. See image

and following log:

2024-01-31 01:34:08 
Jan 30 22:28:29 juju-53f11a-prod grafana-agent.grafana-agent[21793]: ts=2024-01-30T22:18:54.59188991Z caller=dedupe.go:112 agent=prometheus instance=949284e9638ac8d936e7e940dc38679c component=remote level=warn remote_name=949284-b1daae url=http://prometheus-0/api/v1/write msg="Failed to send batch, retrying" err="Post \"http://prometheus/api/v1/write\": context deadline exceeded"

canonical / grafana-agent-operator

Modify the oomkill.rule to include a longer duration #50

Enhancement Proposal