Production - [Alerting] High Number of Machines With Low Disk Space in Some Queue(s)

dotnet-eng-status[bot] commented 1 year ago

:broken_heart: Metric state changed to alerting

Percentage Low Disk {Queue=ubuntu.2004.s390x.experimental.open} 0.5

Metric Graph

Go to rule

@dotnet/dnceng, please investigate

Automation information below, do not change

Grafana-Automated-Alert-Id-2ca5b0285c1e4179b621f916b8b5e75f

dotnet-eng-status[bot] commented 1 year ago

:green_heart: Metric state changed to ok

Metric Graph

Go to rule

dotnet-eng-status[bot] commented 1 year ago

:broken_heart: Metric state changed to alerting

Percentage Low Disk {Queue=ubuntu.2004.s390x.experimental.open} 0.5

Metric Graph

Go to rule

dotnet-eng-status[bot] commented 1 year ago

:green_heart: Metric state changed to ok

Metric Graph

Go to rule

dotnet-eng-status[bot] commented 1 year ago

:broken_heart: Metric state changed to alerting

Percentage Low Disk {Queue=ubuntu.2004.s390x.experimental.open} 0.5

Metric Graph

Go to rule

riarenas commented 1 year ago

This alert keeps flip flopping specifically for this queue. There are two on prem machines in there. What's the procedure here? I can't find any reference to this alert in our wiki. Is this something DDFUN should handle?

garath commented 1 year ago

Hm, a percentage-based trigger on a queue that only has two machines is going to be noisy.

I think DDFUN could handle it but what would we ask them to do?

The value of this alert is tricky to me. It is a very coarse measurement (but the best we could do in the moment).

I wonder, is there a job running the entire duration of the alert? Or does the "low disk" condition span multiple jobs? If the former, which is what I expect, then we should try to make the alert smarter. If the later, then... I'll need to think some more.

premun commented 1 year ago

Maybe it should be evaluating it for longer than 5m? I.e. a job can clean up and it would resolve before alerting?

garath commented 1 year ago

That seems like a good, easy improvement. Evaluate every five minutes for... 2 hours? Something like that.

dotnet-eng-status[bot] commented 11 months ago

:broken_heart: Metric state changed to alerting

Percentage Low Disk {Queue=windows.10.amd64.open.rt} 0.23076923076923078

Metric Graph

Go to rule

dotnet-eng-status[bot] commented 11 months ago

:green_heart: Metric state changed to ok

Metric Graph

Go to rule

dotnet-eng-status[bot] commented 11 months ago

:broken_heart: Metric state changed to alerting

Percentage Low Disk {Queue=windows.10.amd64.open.rt} 0.24

Metric Graph

Go to rule

dotnet-eng-status[bot] commented 11 months ago

:green_heart: Metric state changed to ok

Metric Graph

Go to rule

dotnet-eng-status[bot] commented 11 months ago

:green_heart: Metric state changed to ok

Metric Graph

Go to rule

dotnet-eng-status[bot] commented 11 months ago

:broken_heart: Metric state changed to alerting

Percentage Low Disk {Queue=windows.10.amd64.open.rt} 0.42857142857142855

Metric Graph

Go to rule

dotnet-eng-status[bot] commented 11 months ago

:green_heart: Metric state changed to ok

Metric Graph

Go to rule

dotnet-eng-status[bot] commented 11 months ago

:broken_heart: Metric state changed to alerting

Percentage Low Disk {Queue=windows.10.amd64.open.rt} 0.3333333333333333

Metric Graph

Go to rule

dotnet-eng-status[bot] commented 11 months ago

:green_heart: Metric state changed to ok

Metric Graph

Go to rule

dotnet-eng-status[bot] commented 11 months ago

:broken_heart: Metric state changed to alerting

Percentage Low Disk {Queue=windows.10.amd64.open.rt} 0.25

Metric Graph

Go to rule

dotnet-eng-status[bot] commented 11 months ago

:green_heart: Metric state changed to ok

Metric Graph

Go to rule

dotnet-eng-status[bot] commented 11 months ago

:broken_heart: Metric state changed to alerting

Percentage Low Disk {Queue=windows.10.amd64.open.rt} 0.30303030303030304

Metric Graph

Go to rule

dotnet-eng-status[bot] commented 11 months ago

:broken_heart: Metric state changed to alerting

Percentage Low Disk {Queue=windows.10.amd64.open.rt} 0.20833333333333334

Metric Graph

Go to rule

dotnet-eng-status[bot] commented 11 months ago

:green_heart: Metric state changed to ok

Metric Graph

Go to rule

dotnet-eng-status[bot] commented 11 months ago

:broken_heart: Metric state changed to alerting

Percentage Low Disk {Queue=windows.10.amd64.open.rt} 0.24390243902439024

Metric Graph

Go to rule

dotnet-eng-status[bot] commented 11 months ago

:green_heart: Metric state changed to ok

Metric Graph

Go to rule

dotnet-eng-status[bot] commented 11 months ago

:broken_heart: Metric state changed to alerting

Percentage Low Disk {Queue=windows.10.amd64.open.rt} 0.47619047619047616

Metric Graph

Go to rule

dotnet-eng-status[bot] commented 11 months ago

:green_heart: Metric state changed to ok

Metric Graph

Go to rule

ilyas1974 commented 11 months ago

Closing this alert as the work to correct this is being down as part of https://github.com/dotnet/dnceng/issues/846

dotnet / dnceng

Production - [Alerting] High Number of Machines With Low Disk Space in Some Queue(s) #677