Closed AlitzelMendez closed 5 months ago
The main focus of this work and the reason I am driving this is that we have a significant number of alerts for which we don't know what action to take. In the past, there was always someone on the team who would know what to do with a specific alert, but now we need to ensure that anyone can handle it or at least take the first steps when an alert gets triggered.
The following alerts doesn't have an actionable documentation or the documentation is not clear for someone that doesn't have the full context of the service
1 Incomplete documentation: The documentation is in there but it's incomplete 2 Query is not totally related with the alert
These alerts have a good coumentation, clear and actionable but is on the Alert issue instead of our wikis, this documentation needs to be move to our wiki
Alert Id | Name | Hits ever |
---|---|---|
d2dd705a6c724ed68fcf6955561c06dd | DotNetEng Status Failed Requests/Hour alert | 7 |
5aa74f27ef6445ce9d3d8d3d382e7e35 | Servicing jobs in R&D queues alert | 7 |
Alert that should be triggered under certains scenarios and is not happening, the query needs to be reviewed
Alert Id | Name |
---|---|
763d449c7cd747a786373befe76ad19b | Queue Insights Failures alert |
65dba2e7b92b4c4794316a09e22f918d | Helix Service Fabric MSI Exceptions (For Alerts only) alert |
The following alert gets triggered in Grafana and the status can be seen on the dashboard but we are not alerted in github | Name |
---|---|
SLA - Wait Time Threshold Alert | |
Test Queues: SLA WorkItem Wait Time Alert alert | |
Work Items Waiting Time Is Too High (Build Pools) alert | |
Work Items Waiting Time Is Too High (Test Queues) |
I love this!
cc/ @shawnro
Nice! Does the list also include the ones that Ilya said were disabled because they weren't useful in their current state? (I think it was the on-prem alerts)
Nice! Does the list also include the ones that Ilya said were disabled because they weren't useful in their current state? (I think it was the on-prem alerts)
yes, I think all of them are on: Alerts that are not alerting to GitHub 😄
thanks for taking a look to the list. I am going to start working on this!
As there are a lot of alerts, I will be working on them on different issues, and I will be updating this issue with links to the new issues! :)
Auditing Grafana Alerts to guarantee that
Release Note Category
Release Note Description