Closed dotnet-eng-status[bot] closed 9 months ago
Jobs
| where JobId == "23753888" or JobId == "23753887"
shows JobId | Source | Type | Queued | Started | Finished | QueueName | Repository | Branch |
---|---|---|---|---|---|---|---|---|
23,753,887 | pr/public/dotnet/runtime/refs/pull/96634/merge | test/functional/cli/innerloop/ | 2024-01-08T19:00:25.011Z | 2024-01-08T19:00:25.042Z | 2024-01-08T19:03:47.778Z | windows.11.amd64.android.open | dotnet/runtime | refs/pull/96634/merge |
23,753,888 | pr/public/dotnet/runtime/refs/pull/96634/merge | test/functional/cli/innerloop/ | 2024-01-08T19:00:26.105Z | 2024-01-08T19:00:26.137Z | 2024-01-08T19:05:48.122Z | windows.11.amd64.android.open | dotnet/runtime | refs/pull/96634/merge |
windows.11.amd64.android.open
is an on-prem queue but I'm not sure how the query found the jobs. the closest queue filter would be QueueName =~ 'windows.10.amd64.android.open'
. Is that close enough @garath❔
in any case, I'll look tomorrow to see if these jobs were for an unusual servicing PR or some such. also possible their "innerloop" wasn't supposed to be triggered in servicing. probably time to create windows.11.amd64.android.open.svc
if not.
:green_heart: Metric state changed to ok
One or more servicing jobs were executed in a R&D queue, the expectation is that FR investigates why the jobs weren't redirected. The most common reasons are:
- The job was sent to an on-prem queue, an on-prem queue is one that has osx, arm64 or perf within the name
- We don't have physical hardware for servicing work so on-prem queues should be excluded from this effort. To fix the alert, we need to update the query and add the queue name to the third line where list on-prem queues.
- The job was sent to a queue that doesn't have a corresponding servicing queue
- We need to create the missing queue in helix machines repo
Next steps:
- Go to https://dotnet-eng-grafana.westus2.cloudapp.azure.com/d/historical/backend-status?orgId=1&viewPanel=72
- Investigate every job in the table and decide if we need to update the alert to exclude the job or if need to create a servicing queue for it
For more context go here
windows.11.amd64.android.open is an on-prem queue but I'm not sure how the query found the jobs. the closest queue filter would be QueueName =~ 'windows.10.amd64.android.open'. Is that close enough @garath❔
Sorry, I don't understand the question. Please rephrase?
:broken_heart: Metric state changed to alerting
One or more servicing jobs were executed in a R&D queue, the expectation is that FR investigates why the jobs weren't redirected. The most common reasons are:
- The job was sent to an on-prem queue, an on-prem queue is one that has osx, arm64 or perf within the name
- We don't have physical hardware for servicing work so on-prem queues should be excluded from this effort. To fix the alert, we need to update the query and add the queue name to the third line where list on-prem queues.
- The job was sent to a queue that doesn't have a corresponding servicing queue
- We need to create the missing queue in helix machines repo
Next steps:
- Go to https://dotnet-eng-grafana.westus2.cloudapp.azure.com/d/historical/backend-status?orgId=1&viewPanel=72
- Investigate every job in the table and decide if we need to update the alert to exclude the job or if need to create a servicing queue for it
For more context go here
:green_heart: Metric state changed to ok
Description and instructions for this alert
Helix job from a release branch is running in a non-svc queue.
windows.11.amd64.android.open is an on-prem queue but I'm not sure how the query found the jobs. the closest queue filter would be QueueName =~ 'windows.10.amd64.android.open'. Is that close enough @garath❔
Sorry, I don't understand the question. Please rephrase?
I think I answered my own question… the query for this alert is
let UntrackedQueues = Jobs
| project QueueName = tolower(QueueName)
| where QueueName contains "osx" or QueueName contains "perf" or QueueName contains 'arm' or QueueName contains "arcade" or QueueName contains "xaml" or QueueName contains "appcompat" or QueueName contains "iot" or QueueName contains '.reunion' or QueueName =~ 'windows.10.amd64.android.open' or QueueName contains '.s390x.' or QueueName contains 'ppc64le.experimental'
| distinct QueueName;
Jobs
| where $__timeFilter(Queued)
| where tolower(QueueName) !in (UntrackedQueues)
| extend TargetBranch=parse_json(Properties)["System.PullRequest.TargetBranch"]
| where (Branch contains "/release/" or Branch startswith "release/" or TargetBranch startswith "release/" or TargetBranch contains "/release/") and QueueName !endswith ".svc"
| project JobId, Queued, Repository, Branch, TargetBranch, QueueName
I also found docs for =~
mentioning the operator is the case-insensitive version of ==
. surprised we don't use it more often. (contains
is case-insensitive)
given that, windows.11.amd64.android.open
is not an untracked branch though it is an on-premise queue — it's the public version of Windows.11.Amd64.Pixel.Perf
. I think the right thing to do here is add QueueName =~ 'windows.11.amd64.android.open'
to the query. Agreed❔
PS.
QueueName contains "arcade"
in the queury b/c windows.10.amd64[.open].arcade
are the last queues that match and they aren't on-premise queues. Is that b/c we just decided not to create .svc
variants❔QueueName contains "xaml"
and QueueName contains '.reunion'
are historical only and can be removed.QueueName contains 'arm'
should be removed. these days, the non-"perf", non-"osx" queues w/ "arm" in their names are all in scale sets. removing this will likely have a ripple effect — leading us to create more .svc
queues. this part can wait for @ilyas1974 to return and confirm.:broken_heart: Metric state changed to alerting
Description and instructions for this alert
Helix job from a release branch is running in a non-svc queue.
updated the list and graph queries (could those be shared somehow❔) to add QueueName =~ 'windows.11.amd64.android.open'
and remove QueueName contains "xaml"
and QueueName contains '.reunion'
. still have an open question about (1) and a hoped-for confirmation from @ilyas1974 about (3); didn't take action on either.
the graph and list have returned to normal i.e., No data
and this alert should be ok
soon.
:green_heart: Metric state changed to ok
One or more servicing jobs were executed in a R&D queue, the expectation is that FR investigates why the jobs weren't redirected. The most common reasons are:
- The job was sent to an on-prem queue, an on-prem queue is one that has osx, arm64 or perf within the name
- We don't have physical hardware for servicing work so on-prem queues should be excluded from this effort. To fix the alert, we need to update the query and add the queue name to the third line where list on-prem queues.
- The job was sent to a queue that doesn't have a corresponding servicing queue
- We need to create the missing queue in helix machines repo
Next steps:
- Go to https://dotnet-eng-grafana.westus2.cloudapp.azure.com/d/historical/backend-status?orgId=1&viewPanel=72
- Investigate every job in the table and decide if we need to update the alert to exclude the job or if need to create a servicing queue for it
For more context go here
note: PR above is for longer-term protection; the production dashboards would otherwise get reset during the next rollout
confirmed the rollout made the right prod changes today :grin:
:broken_heart: Metric state changed to alerting
Go to rule
@dotnet/dnceng, please investigate
Automation information below, do not change
Grafana-Automated-Alert-Id-5aa74f27ef6445ce9d3d8d3d382e7e35Release Note Description
Updated saved queries to avoid https://github.com/dotnet/dnceng/issues/1772 in the future