Production - [Alerting] Servicing jobs in R&D queues alert

dotnet / dnceng

.NET Engineering Services

MIT License

25 stars 19 forks source link

Production - [Alerting] Servicing jobs in R&D queues alert #3876

Closed dotnet-eng-status[bot] closed 1 month ago

dotnet-eng-status[bot] commented 2 months ago

:broken_heart: Metric state changed to alerting

One or more servicing jobs were executed in a R&D queue, the expectation is that FR investigates why the jobs weren't redirected. The most common reasons are:

The job was sent to an on-prem queue, an on-prem queue is one that has osx, arm64 or perf within the name

We don't have physical hardware for servicing work so on-prem queues should be excluded from this effort. To fix the alert, we need to update the query and add the queue name to the third line where list on-prem queues.

The job was sent to a queue that doesn't have a corresponding servicing queue

We need to create the missing queue in helix machines repo

Next steps:

Go to https://dotnet-eng-grafana.westus2.cloudapp.azure.com/d/historical/backend-status?orgId=1&viewPanel=72

Investigate every job in the table and decide if we need to update the alert to exclude the job or if need to create a servicing queue for it

For more context go here

ServicingJobs 5

Metric Graph

Go to rule

@dotnet/dnceng, @dotnet/prodconsvcs, please investigate

Automation information below, do not change

Grafana-Automated-Alert-Id-5aa74f27ef6445ce9d3d8d3d382e7e35

dotnet-eng-status[bot] commented 2 months ago

:green_heart: Metric state changed to ok

One or more servicing jobs were executed in a R&D queue, the expectation is that FR investigates why the jobs weren't redirected. The most common reasons are:

The job was sent to an on-prem queue, an on-prem queue is one that has osx, arm64 or perf within the name

We don't have physical hardware for servicing work so on-prem queues should be excluded from this effort. To fix the alert, we need to update the query and add the queue name to the third line where list on-prem queues.

The job was sent to a queue that doesn't have a corresponding servicing queue

We need to create the missing queue in helix machines repo

Next steps:

Go to https://dotnet-eng-grafana.westus2.cloudapp.azure.com/d/historical/backend-status?orgId=1&viewPanel=72

Investigate every job in the table and decide if we need to update the alert to exclude the job or if need to create a servicing queue for it

For more context go here

Metric Graph

Go to rule

dotnet-eng-status[bot] commented 2 months ago

:broken_heart: Metric state changed to alerting

One or more servicing jobs were executed in a R&D queue, the expectation is that FR investigates why the jobs weren't redirected. The most common reasons are:

The job was sent to an on-prem queue, an on-prem queue is one that has osx, arm64 or perf within the name

We don't have physical hardware for servicing work so on-prem queues should be excluded from this effort. To fix the alert, we need to update the query and add the queue name to the third line where list on-prem queues.

The job was sent to a queue that doesn't have a corresponding servicing queue

We need to create the missing queue in helix machines repo

Next steps:

Go to https://dotnet-eng-grafana.westus2.cloudapp.azure.com/d/historical/backend-status?orgId=1&viewPanel=72

Investigate every job in the table and decide if we need to update the alert to exclude the job or if need to create a servicing queue for it

For more context go here

ServicingJobs 3

Metric Graph

Go to rule

dotnet-eng-status[bot] commented 2 months ago

:green_heart: Metric state changed to ok

One or more servicing jobs were executed in a R&D queue, the expectation is that FR investigates why the jobs weren't redirected. The most common reasons are:

The job was sent to an on-prem queue, an on-prem queue is one that has osx, arm64 or perf within the name

We don't have physical hardware for servicing work so on-prem queues should be excluded from this effort. To fix the alert, we need to update the query and add the queue name to the third line where list on-prem queues.

The job was sent to a queue that doesn't have a corresponding servicing queue

We need to create the missing queue in helix machines repo

Next steps:

Go to https://dotnet-eng-grafana.westus2.cloudapp.azure.com/d/historical/backend-status?orgId=1&viewPanel=72

Investigate every job in the table and decide if we need to update the alert to exclude the job or if need to create a servicing queue for it

For more context go here

Metric Graph

Go to rule

dotnet-eng-status[bot] commented 2 months ago

:broken_heart: Metric state changed to alerting

One or more servicing jobs were executed in a R&D queue, the expectation is that FR investigates why the jobs weren't redirected. The most common reasons are:

The job was sent to an on-prem queue, an on-prem queue is one that has osx, arm64 or perf within the name

We don't have physical hardware for servicing work so on-prem queues should be excluded from this effort. To fix the alert, we need to update the query and add the queue name to the third line where list on-prem queues.

The job was sent to a queue that doesn't have a corresponding servicing queue

We need to create the missing queue in helix machines repo

Next steps:

Go to https://dotnet-eng-grafana.westus2.cloudapp.azure.com/d/historical/backend-status?orgId=1&viewPanel=72

Investigate every job in the table and decide if we need to update the alert to exclude the job or if need to create a servicing queue for it

For more context go here

ServicingJobs 6

Metric Graph

Go to rule

dotnet-eng-status[bot] commented 2 months ago

:green_heart: Metric state changed to ok

One or more servicing jobs were executed in a R&D queue, the expectation is that FR investigates why the jobs weren't redirected. The most common reasons are:

The job was sent to an on-prem queue, an on-prem queue is one that has osx, arm64 or perf within the name

We don't have physical hardware for servicing work so on-prem queues should be excluded from this effort. To fix the alert, we need to update the query and add the queue name to the third line where list on-prem queues.

The job was sent to a queue that doesn't have a corresponding servicing queue

We need to create the missing queue in helix machines repo

Next steps:

Go to https://dotnet-eng-grafana.westus2.cloudapp.azure.com/d/historical/backend-status?orgId=1&viewPanel=72

Investigate every job in the table and decide if we need to update the alert to exclude the job or if need to create a servicing queue for it

For more context go here

Metric Graph

Go to rule

riarenas commented 2 months ago

There are two sets of jobs that we are detecting as being sent to the wrong queue:

dotnet-monitor, for the branch refs/heads/test/release/token-updates. This is an edge case in that it seems to be a test branch, so I'm OK with those jobs being sent to R&D. If we see this for a real release branch we can chat with the repo owners.
dotnet/runtime is sending jobs to every android queue in the release/9.0 branch. As we only have a single android queue in the servicing subscription, these are being flagged here.

To fix this second scenario we need to either:

Add all the missing queues to the svc subscription
Stop sending tests to those queues for release branches
Continue working on fixing up this space: https://github.com/dotnet/dnceng/issues/3593

I don't have a good suggestion for what to do with this issue. If we need to keep testing in these android queues, this will just open again.

@ilyas1974 for thoughts.

riarenas commented 1 month ago

This hasn't flipped in over a week, so I'm going to close this iteration. Keep in mind this might show up again, especially the android queue usage.