dotnet / arcade

Tools that provide common build infrastructure for multiple .NET Foundation projects.
MIT License
664 stars 339 forks source link

Auditing Grafana Alerts #12829

Closed AlitzelMendez closed 5 months ago

AlitzelMendez commented 1 year ago

Auditing Grafana Alerts to guarantee that

Release Note Category

AlitzelMendez commented 1 year ago

The main focus of this work and the reason I am driving this is that we have a significant number of alerts for which we don't know what action to take. In the past, there was always someone on the team who would know what to do with a specific alert, but now we need to ensure that anyone can handle it or at least take the first steps when an alert gets triggered.

Alerts that needs doumentation to be updated/created/fixed

The following alerts doesn't have an actionable documentation or the documentation is not clear for someone that doesn't have the full context of the service

Alert Id Name Production: Hits 2 months Production: Hits Ever Issue
54aa0d7e647e46ff9e880bf6ae532b991 Autoscale: Minutes to scale-up from zero machine alert 13 168 #12868
44aff3c937c042caa09f821ae923c26c Azure quota usage for west us 2 8 15 #12868
d2356d84cf3e43ea952d81de941eaa76 On-Prem Machines Heartbeating By Queue alert 7 137 #12868
24cae10d9eca44079e7cf3d47f148497 Helix API Average Response Time 4 23 #12868
6fe0b7b34a004f0bad0064a42f9b91351 Build Analysis: Exceptions and Errors Alert 2 13 #12868
a5641c4a6d8e4e499f1710aa8386d81b1 Servicing Builds Running in R&D Pools Alert 2 7
6213d3c5ce9a46278343bf075798e46f2 Helix AutoScaler Service Stopped Running 1 30
d70761f3c7e84a6380e44943a2e583e61 Apple device failure rate alert 1 6
116ed29b46934330b0aa31e843807b32 On-Prem Machines Heartbeating alert 1 5
6179576701874a7abc440a574cf636d0 Helix API availability 1 5
2ca5b0285c1e4179b621f916b8b5e75f High Number of Machines With Low Disk Space in Some Queue(s) 0 14
7f6435eff89c4306ad11684c461533c5 Data Migration: Migration Queue Depth alert 0 13
f391698eca5c411aa51ad5e3ba37c72e1 Helix Service Fabric exception count alert 0 5
e2be2ec3e22e46d28730bab54ff8fa77 Azure quota usage for west us 0 3
fd20a5c2ffca4e89940bc33e00e7aada Maestro / BarViz alert 0 3
b87805b575d141e4b2b6d258f5e3a79e Maestro Failed Requests/Hour alert 0 2
9a64e127211c4352b38b4ce1e1ddf5ce SQL database size too high 0 1
b50b57fa7d1840438da5232711af4485 Azure quota usage for east us 0 1
4782665d50764ecb9dbd5d929e017115 SQL Cleaner alert 0 0
fb8faaf7600740f98a1c2db076cd1712 source.dot.net Availability 0 0

1 Incomplete documentation: The documentation is in there but it's incomplete
2 Query is not totally related with the alert

Alert that needs documentation to be migrated to our wiki

These alerts have a good coumentation, clear and actionable but is on the Alert issue instead of our wikis, this documentation needs to be move to our wiki

Alert Id Name Hits ever
d2dd705a6c724ed68fcf6955561c06dd DotNetEng Status Failed Requests/Hour alert 7
5aa74f27ef6445ce9d3d8d3d382e7e35 Servicing jobs in R&D queues alert 7

Alerts that "probably" are not alerting when they should

Alert that should be triggered under certains scenarios and is not happening, the query needs to be reviewed

Alert Id Name
763d449c7cd747a786373befe76ad19b Queue Insights Failures alert
65dba2e7b92b4c4794316a09e22f918d Helix Service Fabric MSI Exceptions (For Alerts only) alert

Alerts that are not alerting to GitHub

The following alert gets triggered in Grafana and the status can be seen on the dashboard but we are not alerted in github Name
SLA - Wait Time Threshold Alert
Test Queues: SLA WorkItem Wait Time Alert alert
Work Items Waiting Time Is Too High (Build Pools) alert
Work Items Waiting Time Is Too High (Test Queues)
markwilkie commented 1 year ago

I love this!

cc/ @shawnro

missymessa commented 1 year ago

Nice! Does the list also include the ones that Ilya said were disabled because they weren't useful in their current state? (I think it was the on-prem alerts)

AlitzelMendez commented 1 year ago

Nice! Does the list also include the ones that Ilya said were disabled because they weren't useful in their current state? (I think it was the on-prem alerts)

yes, I think all of them are on: Alerts that are not alerting to GitHub 😄

AlitzelMendez commented 1 year ago

thanks for taking a look to the list. I am going to start working on this!

As there are a lot of alerts, I will be working on them on different issues, and I will be updating this issue with links to the new issues! :)