actions / actions-runner-controller

Kubernetes controller for GitHub Actions self-hosted runners
Apache License 2.0
4.74k stars 1.12k forks source link

`gha_job_execution_duration_seconds_sum` reports wrong value in some cases #3731

Open hpedrorodrigues opened 2 months ago

hpedrorodrigues commented 2 months ago

Checks

Controller Version

0.9.3

Deployment Method

Helm

Checks

To Reproduce

1. Install `gha-runner-scale-set-controller` using the Helm chart via FluxCD
2. Install a few `gha-runner-scale-set`s using the Helm chart via FluxCD
3. Run a few workflows to use these runner sets (including canceling a few of them / either manually or due to `concurrency.group`)

Describe the bug

In a few cases (don't know exact reason yet) the listener reports the metric gha_job_execution_duration_seconds_sum with a wrong value.

Example:

gha_job_execution_duration_seconds_sum{enterprise="",event_name="repository_dispatch",job_name="create-gh-deployment",job_result="canceled",job_workflow_ref="[redacted]/.github/workflows/gh-deployment.yml@refs/heads/master",organization="[redacted]",repository="[redacted]",runner_id="0",runner_name=""} 1.27722295721e+11

Looking at the repository, all runs take less than 60 seconds to finish. The other ones are canceled even before starting because the branch has a new commit.

Screenshot 2024-09-05 at 14 26 34 Screenshot 2024-09-05 at 14 26 50

Describe the expected behavior

Not sure if this is caused only by canceled runs, but I'd expect the listener to return 0 for such runs.

Additional Context

apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: arc-controller
  namespace: arc
spec:
  chart:
    spec:
      chart: gha-runner-scale-set-controller
      sourceRef:
        name: arc
        kind: HelmRepository
        namespace: flux-system
      version: '>=0.9.3'
  interval: 1m
  install:
    crds: CreateReplace
  upgrade:
    crds: CreateReplace
  values:
    replicaCount: 1
    image:
      repository: [redacted]
    serviceAccount:
      create: true
    resources:
      requests:
        cpu: 100m
        memory: 100Mi
      limits:
        cpu: 200m
        memory: 200Mi
    metrics:
      controllerManagerAddr: ':8080'
      listenerAddr: ':8080'
      listenerEndpoint: '/metrics'
    flags:
      logFormat: 'json'
      watchSingleNamespace: 'arc'
---
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: cp-small-runner-set
  namespace: arc
spec:
  chart:
    spec:
      chart: gha-runner-scale-set
      sourceRef:
        name: arc
        kind: HelmRepository
        namespace: flux-system
      version: '>=0.9.3'
  interval: 1m
  values:
    githubConfigUrl: [redacted]
    githubConfigSecret: gh-app-secret
    maxRunners: 10
    minRunners: 0
    runnerGroup: default
    runnerScaleSetName: cp-small
    containerMode:
      type: dind
    template:
      metadata:
        annotations:
          cluster-autoscaler.kubernetes.io/safe-to-evict: 'true'
      spec:
        nodeSelector:
          spot: 'false'
          dedicated-for: github-actions
        tolerations:
          - effect: NoSchedule
            key: dedicated-for
            value: github-actions-2x
        containers:
          - name: runner
            image: arc-default-runner
            command: ['/home/runner/run.sh']
            resources:
              requests:
                cpu: 2
                memory: 4Gi
              limits:
                cpu: 2
                memory: 4Gi
        terminationGracePeriodSeconds: 600

Controller Logs

N/A

Runner Pod Logs

N/A
github-actions[bot] commented 2 months ago

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

Lucas-Hughes commented 2 months ago

I get the same result from canceled runs or when the runner pods failed.

I implemented a bit of a hacky fix by putting parameters in Grafana to ignore certain values above a threshold, but agree that it should be 0 for those runs.

laserpedro commented 2 weeks ago

I get the same result and like @Lucas-Hughes it seems to happen when the jobs are cancelled. That's too bad since this metrics is super valuable since we can create alerts to detect slower than usual github jobs ....