Open benbp opened 2 years ago
Feature requests: (Alert on...) Ran out of memory bad exit code Image pull back off pod eviction if k8s scheduler re-schedules pod emails can contain log and dashboard links
Additional request:
Adding links to a wiki/readme for various alert conditions about how to investigate/remediate.
We should alert/email/notify on various basic stress cluster health conditions and also have dashboards to show various events:
Alerts:
Additional events for dashboard:
Stress infra events: