dotnet / arcade

Tools that provide common build infrastructure for multiple .NET Foundation projects.
MIT License
672 stars 347 forks source link

Production - [Alerting] Autoscale: Minutes to scale-up from zero machine alert #12592

Closed dotnet-eng-status[bot] closed 1 year ago

dotnet-eng-status[bot] commented 1 year ago

:broken_heart: Metric state changed to alerting

Scale up issue: A queue has been waiting for a machine to scale up for more than 45 minutes, there are no machines in this queue, which could cause a lot of work to get stuck.

Wiki link for investigation and mitigation steps here

Metric Graph

Go to rule

@dotnet/dnceng, please investigate

Automation information below, do not change Grafana-Automated-Alert-Id-54aa0d7e647e46ff9e880bf6ae532b99
dotnet-eng-status[bot] commented 1 year ago

:green_heart: Metric state changed to ok

Scale up issue: A queue has been waiting for a machine to scale up for more than 45 minutes, there are no machines in this queue, which could cause a lot of work to get stuck.

Wiki link for investigation and mitigation steps here

Metric Graph

Go to rule

dotnet-eng-status[bot] commented 1 year ago

:broken_heart: Metric state changed to alerting

Scale up issue: A queue has been waiting for a machine to scale up for more than 45 minutes, there are no machines in this queue, which could cause a lot of work to get stuck.

Wiki link for investigation and mitigation steps here

Metric Graph

Go to rule

dotnet-eng-status[bot] commented 1 year ago

:green_heart: Metric state changed to ok

Scale up issue: A queue has been waiting for a machine to scale up for more than 45 minutes, there are no machines in this queue, which could cause a lot of work to get stuck.

Wiki link for investigation and mitigation steps here

Metric Graph

Go to rule

MattGal commented 1 year ago

This is the usual thing that plagues these two "reunion" named queues. There are lots of VMs in this state in the project reunion sub:

...
    {
      "code": "PowerState/deallocated",
      "level": "Info",
      "displayStatus": "VM deallocated"
    }
...

Checking the logs in the back end, it's also the usual problem:

Allocation failed. Please note that allocation for this subscription is constrained to a set of clusters, which may be out of capacity. Read more about improving likelihood of allocation success at http://aka.ms/allocation-guidance. To remove the cluster constraint, please contact the subscription administrator or Microsoft Support.
...

There's nothing we can do about this other than encourage the owners of the sub to change settings. Closing this issue.