dotnet / dnceng

.NET Engineering Services
MIT License
22 stars 16 forks source link

Production - [Alerting] Android emulator failure rate alert #3049

Open dotnet-eng-status[bot] opened 1 month ago

dotnet-eng-status[bot] commented 1 month ago

:broken_heart: Metric state changed to alerting

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

Go to rule

@dotnet/dnceng, please investigate

Automation information below, do not change Grafana-Automated-Alert-Id-e38f14fe3367451d8de43da6e2453fdd

Release Note Category

dotnet-eng-status[bot] commented 1 month ago

:green_heart: Metric state changed to ok

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

Go to rule

dotnet-eng-status[bot] commented 1 month ago

:broken_heart: Metric state changed to alerting

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

Go to rule

dotnet-eng-status[bot] commented 1 month ago

:green_heart: Metric state changed to ok

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

Go to rule

dotnet-eng-status[bot] commented 1 month ago

:broken_heart: Metric state changed to alerting

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

Go to rule

dotnet-eng-status[bot] commented 1 month ago

:green_heart: Metric state changed to ok

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

Go to rule

dotnet-eng-status[bot] commented 1 month ago

:broken_heart: Metric state changed to alerting

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

Go to rule

dotnet-eng-status[bot] commented 1 month ago

:broken_heart: Metric state changed to alerting

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

Go to rule

dotnet-eng-status[bot] commented 1 month ago

:green_heart: Metric state changed to ok

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

Go to rule

dotnet-eng-status[bot] commented 1 month ago

:broken_heart: Metric state changed to alerting

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

Go to rule

dotnet-eng-status[bot] commented 1 month ago

:broken_heart: Metric state changed to alerting

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

Go to rule

dotnet-eng-status[bot] commented 1 month ago

:broken_heart: Metric state changed to alerting

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

Go to rule

dotnet-eng-status[bot] commented 1 month ago

:broken_heart: Metric state changed to alerting

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

Go to rule

dotnet-eng-status[bot] commented 1 month ago

:green_heart: Metric state changed to ok

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

Go to rule

dotnet-eng-status[bot] commented 1 month ago

:broken_heart: Metric state changed to alerting

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

Go to rule

dotnet-eng-status[bot] commented 1 month ago

:broken_heart: Metric state changed to alerting

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

Go to rule

dotnet-eng-status[bot] commented 1 month ago

:broken_heart: Metric state changed to alerting

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

Go to rule

dotnet-eng-status[bot] commented 1 month ago

:broken_heart: Metric state changed to alerting

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

Go to rule

AlitzelMendez commented 1 month ago

this is only happening for a couple of machines, so probably still follows up under the "ignore" (not catastrophic scenario), the errors are: device not found.

but it is happening kind of consistent, should we give a bigger follow up to this @premun ?

dotnet-eng-status[bot] commented 1 month ago

:broken_heart: Metric state changed to alerting

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

Go to rule

premun commented 1 month ago

The only possibility is probably tuning the alert trigger conditions. Unfortunately, I don't know how to only make this alert when the number of machines is larger than some number. Maybe it is possible though..

dotnet-eng-status[bot] commented 1 month ago

:green_heart: Metric state changed to ok

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

Go to rule

dotnet-eng-status[bot] commented 4 weeks ago

:broken_heart: Metric state changed to alerting

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

Go to rule

dotnet-eng-status[bot] commented 4 weeks ago

:broken_heart: Metric state changed to alerting

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

Go to rule

dotnet-eng-status[bot] commented 4 weeks ago

:green_heart: Metric state changed to ok

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

Go to rule

dotnet-eng-status[bot] commented 3 weeks ago

:broken_heart: Metric state changed to alerting

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

Go to rule

dotnet-eng-status[bot] commented 3 weeks ago

:broken_heart: Metric state changed to alerting

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

Go to rule

dotnet-eng-status[bot] commented 3 weeks ago

:broken_heart: Metric state changed to alerting

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

Go to rule

dotnet-eng-status[bot] commented 3 weeks ago

:broken_heart: Metric state changed to alerting

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

Go to rule

dotnet-eng-status[bot] commented 3 weeks ago

:broken_heart: Metric state changed to alerting

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

Go to rule

dotnet-eng-status[bot] commented 3 weeks ago

:broken_heart: Metric state changed to alerting

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

Go to rule

dotnet-eng-status[bot] commented 3 weeks ago

:green_heart: Metric state changed to ok

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

Go to rule

dotnet-eng-status[bot] commented 2 weeks ago

:green_heart: Metric state changed to ok

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

Go to rule

AlitzelMendez commented 2 weeks ago

Hi @premun ,

I tried to create a different approach for this alert on this pull request: https://dev.azure.com/dnceng/internal/_git/dotnet-helix-service/pullrequest/40850

can you take a look and let me know if this could work, as it is quite different the visual information, the result is the expected but the main difference is that it only shows counts and only have one line with the failed machines

I tweaked to 30% for this example as right now all the machines are in a good state

image

but in there are only listed the machines > 80% failure rate and when the number is bigger than X it will trigger

also, which number of machines would be appropiate?

dotnet-eng-status[bot] commented 3 days ago

:broken_heart: Metric state changed to alerting

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

Go to rule

dotnet-eng-status[bot] commented 3 days ago

:green_heart: Metric state changed to ok

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

Go to rule

dotnet-eng-status[bot] commented 2 days ago

:broken_heart: Metric state changed to alerting

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

Go to rule

dotnet-eng-status[bot] commented 2 days ago

:broken_heart: Metric state changed to alerting

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

Go to rule

dotnet-eng-status[bot] commented 2 days ago

:green_heart: Metric state changed to ok

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

Go to rule

premun commented 2 days ago

Hey @AlitzelMendez,

I think this is a good solution to the problems of this alert. Can you show me the Grafana query if you have edited some panel in staging/prod Grafana with this already? I think it would be easier to review if I can see the changes live.

dotnet-eng-status[bot] commented 1 day ago

:broken_heart: Metric state changed to alerting

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

Go to rule

dotnet-eng-status[bot] commented 1 day ago

:broken_heart: Metric state changed to alerting

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

Go to rule

dotnet-eng-status[bot] commented 19 hours ago

:broken_heart: Metric state changed to alerting

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

Go to rule

dotnet-eng-status[bot] commented 16 hours ago

:green_heart: Metric state changed to ok

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

Go to rule

AlitzelMendez commented 14 hours ago

Hey @AlitzelMendez,

I think this is a good solution to the problems of this alert. Can you show me the Grafana query if you have edited some panel in staging/prod Grafana with this already? I think it would be easier to review if I can see the changes live.

@premun , I updated this panel with the changes: https://dotnet-eng-grafana.westus2.cloudapp.azure.com/d/mobileDevices/mobile-devices?tab=alert&viewPanel=19&orgId=1&editPanel=19 😄

premun commented 8 hours ago

Yeah, looks good!