Closed dotnet-eng-status[bot] closed 1 month ago
:green_heart: Metric state changed to ok
One or more servicing jobs were executed in a R&D queue, the expectation is that FR investigates why the jobs weren't redirected. The most common reasons are:
- The job was sent to an on-prem queue, an on-prem queue is one that has osx, arm64 or perf within the name
- We don't have physical hardware for servicing work so on-prem queues should be excluded from this effort. To fix the alert, we need to update the query and add the queue name to the third line where list on-prem queues.
- The job was sent to a queue that doesn't have a corresponding servicing queue
- We need to create the missing queue in helix machines repo
Next steps:
- Go to https://dotnet-eng-grafana.westus2.cloudapp.azure.com/d/historical/backend-status?orgId=1&viewPanel=72
- Investigate every job in the table and decide if we need to update the alert to exclude the job or if need to create a servicing queue for it
For more context go here
:broken_heart: Metric state changed to alerting
One or more servicing jobs were executed in a R&D queue, the expectation is that FR investigates why the jobs weren't redirected. The most common reasons are:
- The job was sent to an on-prem queue, an on-prem queue is one that has osx, arm64 or perf within the name
- We don't have physical hardware for servicing work so on-prem queues should be excluded from this effort. To fix the alert, we need to update the query and add the queue name to the third line where list on-prem queues.
- The job was sent to a queue that doesn't have a corresponding servicing queue
- We need to create the missing queue in helix machines repo
Next steps:
- Go to https://dotnet-eng-grafana.westus2.cloudapp.azure.com/d/historical/backend-status?orgId=1&viewPanel=72
- Investigate every job in the table and decide if we need to update the alert to exclude the job or if need to create a servicing queue for it
For more context go here
:green_heart: Metric state changed to ok
One or more servicing jobs were executed in a R&D queue, the expectation is that FR investigates why the jobs weren't redirected. The most common reasons are:
- The job was sent to an on-prem queue, an on-prem queue is one that has osx, arm64 or perf within the name
- We don't have physical hardware for servicing work so on-prem queues should be excluded from this effort. To fix the alert, we need to update the query and add the queue name to the third line where list on-prem queues.
- The job was sent to a queue that doesn't have a corresponding servicing queue
- We need to create the missing queue in helix machines repo
Next steps:
- Go to https://dotnet-eng-grafana.westus2.cloudapp.azure.com/d/historical/backend-status?orgId=1&viewPanel=72
- Investigate every job in the table and decide if we need to update the alert to exclude the job or if need to create a servicing queue for it
For more context go here
:broken_heart: Metric state changed to alerting
One or more servicing jobs were executed in a R&D queue, the expectation is that FR investigates why the jobs weren't redirected. The most common reasons are:
- The job was sent to an on-prem queue, an on-prem queue is one that has osx, arm64 or perf within the name
- We don't have physical hardware for servicing work so on-prem queues should be excluded from this effort. To fix the alert, we need to update the query and add the queue name to the third line where list on-prem queues.
- The job was sent to a queue that doesn't have a corresponding servicing queue
- We need to create the missing queue in helix machines repo
Next steps:
- Go to https://dotnet-eng-grafana.westus2.cloudapp.azure.com/d/historical/backend-status?orgId=1&viewPanel=72
- Investigate every job in the table and decide if we need to update the alert to exclude the job or if need to create a servicing queue for it
For more context go here
:green_heart: Metric state changed to ok
One or more servicing jobs were executed in a R&D queue, the expectation is that FR investigates why the jobs weren't redirected. The most common reasons are:
- The job was sent to an on-prem queue, an on-prem queue is one that has osx, arm64 or perf within the name
- We don't have physical hardware for servicing work so on-prem queues should be excluded from this effort. To fix the alert, we need to update the query and add the queue name to the third line where list on-prem queues.
- The job was sent to a queue that doesn't have a corresponding servicing queue
- We need to create the missing queue in helix machines repo
Next steps:
- Go to https://dotnet-eng-grafana.westus2.cloudapp.azure.com/d/historical/backend-status?orgId=1&viewPanel=72
- Investigate every job in the table and decide if we need to update the alert to exclude the job or if need to create a servicing queue for it
For more context go here
There are two sets of jobs that we are detecting as being sent to the wrong queue:
To fix this second scenario we need to either:
I don't have a good suggestion for what to do with this issue. If we need to keep testing in these android queues, this will just open again.
@ilyas1974 for thoughts.
This hasn't flipped in over a week, so I'm going to close this iteration. Keep in mind this might show up again, especially the android queue usage.
:broken_heart: Metric state changed to alerting
Go to rule
@dotnet/dnceng, @dotnet/prodconsvcs, please investigate
Automation information below, do not change
Grafana-Automated-Alert-Id-5aa74f27ef6445ce9d3d8d3d382e7e35