dotnet / dnceng

.NET Engineering Services
MIT License
24 stars 19 forks source link

Production - [Alerting] Servicing jobs in R&D queues alert #1772

Closed dotnet-eng-status[bot] closed 9 months ago

dotnet-eng-status[bot] commented 10 months ago

:broken_heart: Metric state changed to alerting

One or more servicing jobs were executed in a R&D queue, the expectation is that FR investigates why the jobs weren't redirected. The most common reasons are:

  • The job was sent to an on-prem queue, an on-prem queue is one that has osx, arm64 or perf within the name
    • We don't have physical hardware for servicing work so on-prem queues should be excluded from this effort. To fix the alert, we need to update the query and add the queue name to the third line where list on-prem queues.
  • The job was sent to a queue that doesn't have a corresponding servicing queue
    • We need to create the missing queue in helix machines repo

Next steps:

  1. Go to https://dotnet-eng-grafana.westus2.cloudapp.azure.com/d/historical/backend-status?orgId=1&viewPanel=72
  2. Investigate every job in the table and decide if we need to update the alert to exclude the job or if need to create a servicing queue for it

For more context go here

Metric Graph

Go to rule

@dotnet/dnceng, please investigate

Automation information below, do not change Grafana-Automated-Alert-Id-5aa74f27ef6445ce9d3d8d3d382e7e35

Release Note Description

Updated saved queries to avoid https://github.com/dotnet/dnceng/issues/1772 in the future

dougbu commented 10 months ago
Jobs
| where JobId == "23753888" or JobId == "23753887"
shows JobId Source Type Queued Started Finished QueueName Repository Branch
23,753,887 pr/public/dotnet/runtime/refs/pull/96634/merge test/functional/cli/innerloop/ 2024-01-08T19:00:25.011Z 2024-01-08T19:00:25.042Z 2024-01-08T19:03:47.778Z windows.11.amd64.android.open dotnet/runtime refs/pull/96634/merge
23,753,888 pr/public/dotnet/runtime/refs/pull/96634/merge test/functional/cli/innerloop/ 2024-01-08T19:00:26.105Z 2024-01-08T19:00:26.137Z 2024-01-08T19:05:48.122Z windows.11.amd64.android.open dotnet/runtime refs/pull/96634/merge
dougbu commented 10 months ago

windows.11.amd64.android.open is an on-prem queue but I'm not sure how the query found the jobs. the closest queue filter would be QueueName =~ 'windows.10.amd64.android.open'. Is that close enough @garath❔

in any case, I'll look tomorrow to see if these jobs were for an unusual servicing PR or some such. also possible their "innerloop" wasn't supposed to be triggered in servicing. probably time to create windows.11.amd64.android.open.svc if not.

dotnet-eng-status[bot] commented 10 months ago

:green_heart: Metric state changed to ok

One or more servicing jobs were executed in a R&D queue, the expectation is that FR investigates why the jobs weren't redirected. The most common reasons are:

  • The job was sent to an on-prem queue, an on-prem queue is one that has osx, arm64 or perf within the name
    • We don't have physical hardware for servicing work so on-prem queues should be excluded from this effort. To fix the alert, we need to update the query and add the queue name to the third line where list on-prem queues.
  • The job was sent to a queue that doesn't have a corresponding servicing queue
    • We need to create the missing queue in helix machines repo

Next steps:

  1. Go to https://dotnet-eng-grafana.westus2.cloudapp.azure.com/d/historical/backend-status?orgId=1&viewPanel=72
  2. Investigate every job in the table and decide if we need to update the alert to exclude the job or if need to create a servicing queue for it

For more context go here

Metric Graph

Go to rule

garath commented 10 months ago

windows.11.amd64.android.open is an on-prem queue but I'm not sure how the query found the jobs. the closest queue filter would be QueueName =~ 'windows.10.amd64.android.open'. Is that close enough @garath❔

Sorry, I don't understand the question. Please rephrase?

dotnet-eng-status[bot] commented 10 months ago

:broken_heart: Metric state changed to alerting

One or more servicing jobs were executed in a R&D queue, the expectation is that FR investigates why the jobs weren't redirected. The most common reasons are:

  • The job was sent to an on-prem queue, an on-prem queue is one that has osx, arm64 or perf within the name
    • We don't have physical hardware for servicing work so on-prem queues should be excluded from this effort. To fix the alert, we need to update the query and add the queue name to the third line where list on-prem queues.
  • The job was sent to a queue that doesn't have a corresponding servicing queue
    • We need to create the missing queue in helix machines repo

Next steps:

  1. Go to https://dotnet-eng-grafana.westus2.cloudapp.azure.com/d/historical/backend-status?orgId=1&viewPanel=72
  2. Investigate every job in the table and decide if we need to update the alert to exclude the job or if need to create a servicing queue for it

For more context go here

Metric Graph

Go to rule

dotnet-eng-status[bot] commented 10 months ago

:green_heart: Metric state changed to ok

Description and instructions for this alert

Helix job from a release branch is running in a non-svc queue.

Metric Graph

Go to rule

dougbu commented 10 months ago

windows.11.amd64.android.open is an on-prem queue but I'm not sure how the query found the jobs. the closest queue filter would be QueueName =~ 'windows.10.amd64.android.open'. Is that close enough @garath❔

Sorry, I don't understand the question. Please rephrase?

I think I answered my own question… the query for this alert is

let UntrackedQueues = Jobs 
| project QueueName = tolower(QueueName)
| where QueueName contains "osx" or QueueName contains "perf" or QueueName contains 'arm' or QueueName contains "arcade" or QueueName contains "xaml" or QueueName contains "appcompat" or QueueName contains "iot" or QueueName contains '.reunion' or QueueName =~ 'windows.10.amd64.android.open' or QueueName contains  '.s390x.' or QueueName contains 'ppc64le.experimental'
| distinct QueueName;
Jobs
| where  $__timeFilter(Queued)
| where tolower(QueueName) !in (UntrackedQueues)
| extend TargetBranch=parse_json(Properties)["System.PullRequest.TargetBranch"]
| where (Branch contains "/release/" or Branch startswith "release/" or TargetBranch startswith "release/" or TargetBranch contains "/release/") and QueueName !endswith ".svc"
| project JobId, Queued, Repository, Branch, TargetBranch, QueueName

I also found docs for =~ mentioning the operator is the case-insensitive version of ==. surprised we don't use it more often. (contains is case-insensitive)

given that, windows.11.amd64.android.open is not an untracked branch though it is an on-premise queue — it's the public version of Windows.11.Amd64.Pixel.Perf. I think the right thing to do here is add QueueName =~ 'windows.11.amd64.android.open' to the query. Agreed❔


PS.

  1. I don't understand QueueName contains "arcade" in the queury b/c windows.10.amd64[.open].arcade are the last queues that match and they aren't on-premise queues. Is that b/c we just decided not to create .svc variants❔
  2. QueueName contains "xaml" and QueueName contains '.reunion' are historical only and can be removed.
  3. QueueName contains 'arm' should be removed. these days, the non-"perf", non-"osx" queues w/ "arm" in their names are all in scale sets. removing this will likely have a ripple effect — leading us to create more .svc queues. this part can wait for @ilyas1974 to return and confirm.
dotnet-eng-status[bot] commented 9 months ago

:broken_heart: Metric state changed to alerting

Description and instructions for this alert

Helix job from a release branch is running in a non-svc queue.

Metric Graph

Go to rule

dougbu commented 9 months ago

updated the list and graph queries (could those be shared somehow❔) to add QueueName =~ 'windows.11.amd64.android.open' and remove QueueName contains "xaml" and QueueName contains '.reunion'. still have an open question about (1) and a hoped-for confirmation from @ilyas1974 about (3); didn't take action on either.

the graph and list have returned to normal i.e., No data and this alert should be ok soon.

dougbu commented 9 months ago

fix in !36576

dotnet-eng-status[bot] commented 9 months ago

:green_heart: Metric state changed to ok

One or more servicing jobs were executed in a R&D queue, the expectation is that FR investigates why the jobs weren't redirected. The most common reasons are:

  • The job was sent to an on-prem queue, an on-prem queue is one that has osx, arm64 or perf within the name
    • We don't have physical hardware for servicing work so on-prem queues should be excluded from this effort. To fix the alert, we need to update the query and add the queue name to the third line where list on-prem queues.
  • The job was sent to a queue that doesn't have a corresponding servicing queue
    • We need to create the missing queue in helix machines repo

Next steps:

  1. Go to https://dotnet-eng-grafana.westus2.cloudapp.azure.com/d/historical/backend-status?orgId=1&viewPanel=72
  2. Investigate every job in the table and decide if we need to update the alert to exclude the job or if need to create a servicing queue for it

For more context go here

Metric Graph

Go to rule

dougbu commented 9 months ago

note: PR above is for longer-term protection; the production dashboards would otherwise get reset during the next rollout

dougbu commented 9 months ago

confirmed the rollout made the right prod changes today :grin: