dotnet / dnceng

.NET Engineering Services
MIT License
24 stars 18 forks source link

Production - [Alerting] Autoscale: Minutes to scale-up from zero machine alert #2495

Closed dotnet-eng-status[bot] closed 5 months ago

dotnet-eng-status[bot] commented 5 months ago

:broken_heart: Metric state changed to alerting

Scale up issue: A queue has been waiting for a machine to scale up for more than 45 minutes, there are no machines in this queue, which could cause a lot of work to get stuck.

Wiki link for investigation and mitigation steps here

Metric Graph

Go to rule

@dotnet/dnceng, please investigate

Automation information below, do not change Grafana-Automated-Alert-Id-54aa0d7e647e46ff9e880bf6ae532b99
dotnet-eng-status[bot] commented 5 months ago

:green_heart: Metric state changed to ok

Scale up issue: A queue has been waiting for a machine to scale up for more than 45 minutes, there are no machines in this queue, which could cause a lot of work to get stuck.

Wiki link for investigation and mitigation steps here

Metric Graph

Go to rule

riarenas commented 5 months ago

ooh, not arm! looking

riarenas commented 5 months ago

The queue doesn't currently have any broken machines to investigate, but we did see a spike in VMs getting deleted by the VM cleaner around this time, which points to a transient problem.