dotnet / dnceng

.NET Engineering Services
MIT License
25 stars 19 forks source link

Production - [Alerting] Servicing jobs in R&D queues alert #3876

Closed dotnet-eng-status[bot] closed 1 month ago

dotnet-eng-status[bot] commented 2 months ago

:broken_heart: Metric state changed to alerting

One or more servicing jobs were executed in a R&D queue, the expectation is that FR investigates why the jobs weren't redirected. The most common reasons are:

  • The job was sent to an on-prem queue, an on-prem queue is one that has osx, arm64 or perf within the name
    • We don't have physical hardware for servicing work so on-prem queues should be excluded from this effort. To fix the alert, we need to update the query and add the queue name to the third line where list on-prem queues.
  • The job was sent to a queue that doesn't have a corresponding servicing queue
    • We need to create the missing queue in helix machines repo

Next steps:

  1. Go to https://dotnet-eng-grafana.westus2.cloudapp.azure.com/d/historical/backend-status?orgId=1&viewPanel=72
  2. Investigate every job in the table and decide if we need to update the alert to exclude the job or if need to create a servicing queue for it

For more context go here

Metric Graph

Go to rule

@dotnet/dnceng, @dotnet/prodconsvcs, please investigate

Automation information below, do not change Grafana-Automated-Alert-Id-5aa74f27ef6445ce9d3d8d3d382e7e35
dotnet-eng-status[bot] commented 2 months ago

:green_heart: Metric state changed to ok

One or more servicing jobs were executed in a R&D queue, the expectation is that FR investigates why the jobs weren't redirected. The most common reasons are:

  • The job was sent to an on-prem queue, an on-prem queue is one that has osx, arm64 or perf within the name
    • We don't have physical hardware for servicing work so on-prem queues should be excluded from this effort. To fix the alert, we need to update the query and add the queue name to the third line where list on-prem queues.
  • The job was sent to a queue that doesn't have a corresponding servicing queue
    • We need to create the missing queue in helix machines repo

Next steps:

  1. Go to https://dotnet-eng-grafana.westus2.cloudapp.azure.com/d/historical/backend-status?orgId=1&viewPanel=72
  2. Investigate every job in the table and decide if we need to update the alert to exclude the job or if need to create a servicing queue for it

For more context go here

Metric Graph

Go to rule

dotnet-eng-status[bot] commented 2 months ago

:broken_heart: Metric state changed to alerting

One or more servicing jobs were executed in a R&D queue, the expectation is that FR investigates why the jobs weren't redirected. The most common reasons are:

  • The job was sent to an on-prem queue, an on-prem queue is one that has osx, arm64 or perf within the name
    • We don't have physical hardware for servicing work so on-prem queues should be excluded from this effort. To fix the alert, we need to update the query and add the queue name to the third line where list on-prem queues.
  • The job was sent to a queue that doesn't have a corresponding servicing queue
    • We need to create the missing queue in helix machines repo

Next steps:

  1. Go to https://dotnet-eng-grafana.westus2.cloudapp.azure.com/d/historical/backend-status?orgId=1&viewPanel=72
  2. Investigate every job in the table and decide if we need to update the alert to exclude the job or if need to create a servicing queue for it

For more context go here

Metric Graph

Go to rule

dotnet-eng-status[bot] commented 2 months ago

:green_heart: Metric state changed to ok

One or more servicing jobs were executed in a R&D queue, the expectation is that FR investigates why the jobs weren't redirected. The most common reasons are:

  • The job was sent to an on-prem queue, an on-prem queue is one that has osx, arm64 or perf within the name
    • We don't have physical hardware for servicing work so on-prem queues should be excluded from this effort. To fix the alert, we need to update the query and add the queue name to the third line where list on-prem queues.
  • The job was sent to a queue that doesn't have a corresponding servicing queue
    • We need to create the missing queue in helix machines repo

Next steps:

  1. Go to https://dotnet-eng-grafana.westus2.cloudapp.azure.com/d/historical/backend-status?orgId=1&viewPanel=72
  2. Investigate every job in the table and decide if we need to update the alert to exclude the job or if need to create a servicing queue for it

For more context go here

Metric Graph

Go to rule

dotnet-eng-status[bot] commented 2 months ago

:broken_heart: Metric state changed to alerting

One or more servicing jobs were executed in a R&D queue, the expectation is that FR investigates why the jobs weren't redirected. The most common reasons are:

  • The job was sent to an on-prem queue, an on-prem queue is one that has osx, arm64 or perf within the name
    • We don't have physical hardware for servicing work so on-prem queues should be excluded from this effort. To fix the alert, we need to update the query and add the queue name to the third line where list on-prem queues.
  • The job was sent to a queue that doesn't have a corresponding servicing queue
    • We need to create the missing queue in helix machines repo

Next steps:

  1. Go to https://dotnet-eng-grafana.westus2.cloudapp.azure.com/d/historical/backend-status?orgId=1&viewPanel=72
  2. Investigate every job in the table and decide if we need to update the alert to exclude the job or if need to create a servicing queue for it

For more context go here

Metric Graph

Go to rule

dotnet-eng-status[bot] commented 2 months ago

:green_heart: Metric state changed to ok

One or more servicing jobs were executed in a R&D queue, the expectation is that FR investigates why the jobs weren't redirected. The most common reasons are:

  • The job was sent to an on-prem queue, an on-prem queue is one that has osx, arm64 or perf within the name
    • We don't have physical hardware for servicing work so on-prem queues should be excluded from this effort. To fix the alert, we need to update the query and add the queue name to the third line where list on-prem queues.
  • The job was sent to a queue that doesn't have a corresponding servicing queue
    • We need to create the missing queue in helix machines repo

Next steps:

  1. Go to https://dotnet-eng-grafana.westus2.cloudapp.azure.com/d/historical/backend-status?orgId=1&viewPanel=72
  2. Investigate every job in the table and decide if we need to update the alert to exclude the job or if need to create a servicing queue for it

For more context go here

Metric Graph

Go to rule

riarenas commented 2 months ago

There are two sets of jobs that we are detecting as being sent to the wrong queue:

To fix this second scenario we need to either:

I don't have a good suggestion for what to do with this issue. If we need to keep testing in these android queues, this will just open again.

@ilyas1974 for thoughts.

riarenas commented 1 month ago

This hasn't flipped in over a week, so I'm going to close this iteration. Keep in mind this might show up again, especially the android queue usage.