Change the oomkill.rule to include a duration greater than 1m in the range vector, e.g. 1h. So change the rule expression from
expr: increase(node_vmstat_oom_kill[1m]) > 0
to
expr: increase(node_vmstat_oom_kill[1h]) > 0
Reason: We had several OOM kills for some units and did not get alerted because the network was not working properly after the OOM, so the increase between a minute was never higher than 0. See image
Enhancement Proposal
Change the oomkill.rule to include a duration greater than 1m in the range vector, e.g. 1h. So change the rule expression from
to
Reason: We had several OOM kills for some units and did not get alerted because the network was not working properly after the OOM, so the increase between a minute was never higher than 0. See image
and following log: