Open Kalaivin opened 1 month ago
Coming from a System Center Operations Manager (SCOM) background, alert tuning is always a hot topic. SCOM was always delivered with a default set of alert rules and monitored enabled out of the box. After deploying SCOM, you would spend the next several months tuning alerts by running "Top 10" alert reports to see what the top talkers are. You either address the hardware or software issue and/or tune the alerts to a threshold that prevents the alert from firing so often. Other times, you just disable the alert, because it is not relevant to your environment.
Initially, I recommend only enabling Severity 0 alerts to be emailed out via the Action group, and then go into Azure monitor weekly for the top alerts firing, and address/remediate each alert.
I like having all the alerts firing initially, because I can go into Azure monitor and get an overall health of the environment by seeing the alerts that are firing. As time goes on, you will be adjusting thresholds and windows, and alert severities to match what is supportable in your environment. There is no one size fits all set of alerts or configuration...
Check for previous/existing GitHub issues
Description
When monitoring IT systems, it’s a best practice to first identify the events or system properties of interest, then determine the appropriate actions to take when these events or property changes occur. Only after this should monitoring and alerts be configured to ensure these events and changes are observed. This approach reduces alert noise and clarifies the actions needed when a particular alert is triggered. Ultimately, it also allows for automation of these actions, reducing the time to resolution.
My request is to update the documentation to provide additional context for each alert: why is it important, and what are the recommended actions when the alert is fired?
Thank you!