Azure / azure-sdk-tools

Tools repository leveraged by the Azure SDK team.
MIT License
114 stars 177 forks source link

[stress testing] Add better monitoring for stress cluster conditions #2512

Open benbp opened 2 years ago

benbp commented 2 years ago

We should alert/email/notify on various basic stress cluster health conditions and also have dashboards to show various events:

Alerts:

  1. Nodes down
  2. Pods sitting unable to be scheduled
  3. Key services downtime (e.g. svcs in kube-system and stress-infra namespaces: stress watcher, chaos mesh, oms agent, secret store provider, storage class provider, kube apiserver).

Additional events for dashboard:

  1. Pod OOMKills over time by pod/namespace/node/etc.
  2. Pod evicted and/or insufficient resource events over time
  3. Key service restarts and resource usage

Stress infra events:

  1. Add and emit metrics from stress watcher for failures (e.g. credential failure)
ckairen commented 2 years ago

Feature requests: (Alert on...) Ran out of memory bad exit code Image pull back off pod eviction if k8s scheduler re-schedules pod emails can contain log and dashboard links

benbp commented 2 years ago

Additional request:

Adding links to a wiki/readme for various alert conditions about how to investigate/remediate.