actions / actions-runner-controller

Kubernetes controller for GitHub Actions self-hosted runners
Apache License 2.0
4.4k stars 1.04k forks source link

Workflow Job run duration metrics are always "+Inf" #3522

Open jeremydonahue opened 1 month ago

jeremydonahue commented 1 month ago

Checks

Controller Version

v0.27.6

Helm Chart Version

0.23.7

CertManager Version

v1.10.1

Deployment Method

Helm

cert-manager installation

Not applicable to this issue but is installed correctly.

Checks

Resource Definitions

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app.kubernetes.io/instance: arc-system
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: actions-runner-controller
    app.kubernetes.io/version: 0.27.6
    core.zr.org/team: core
    helm.sh/chart: actions-runner-controller-0.23.7
  name: arc-system-actions-runner-controller-actions-metrics-server
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/instance: arc-system-actions-metrics-server
      app.kubernetes.io/name: actions-runner-controller
  template:
    metadata:
      labels:
        app.kubernetes.io/instance: arc-system-actions-metrics-server
        app.kubernetes.io/name: actions-runner-controller
    spec:
      containers:
      - args:
        - --metrics-addr=127.0.0.1:8080
        - --log-format=json
        command:
        - /actions-metrics-server
        env:
        - name: GITHUB_WEBHOOK_SECRET_TOKEN
          valueFrom:
            secretKeyRef:
              key: github_webhook_secret_token
              name: app-secrets
              optional: true
        - name: GITHUB_TOKEN
          valueFrom:
            secretKeyRef:
              key: github_token
              name: app-secrets
              optional: true
        - name: GITHUB_APP_ID
          valueFrom:
            secretKeyRef:
              key: github_app_id
              name: app-secrets
              optional: true
        - name: GITHUB_APP_INSTALLATION_ID
          valueFrom:
            secretKeyRef:
              key: github_app_installation_id
              name: app-secrets
              optional: true
        - name: GITHUB_APP_PRIVATE_KEY
          valueFrom:
            secretKeyRef:
              key: github_app_private_key
              name: app-secrets
              optional: true
        - name: GITHUB_BASICAUTH_PASSWORD
          valueFrom:
            secretKeyRef:
              key: github_basicauth_password
              name: app-secrets
              optional: true
        image: summerwind/actions-runner-controller:v0.27.6
        name: actions-metrics-server
        ports:
        - containerPort: 8000
          name: http
          protocol: TCP
        resources:
          limits:
            cpu: 1
            memory: 1Gi
          requests:
            cpu: 1
            memory: 1Gi
      - args:
        - --secure-listen-address=0.0.0.0:8443
        - --upstream=http://127.0.0.1:8080/
        - --logtostderr=true
        - --v=10
        image: quay.io/brancz/kube-rbac-proxy:v0.13.1
        name: kube-rbac-proxy
        ports:
        - containerPort: 8443
          name: metrics-port
        resources:
          limits:
            cpu: 1
            memory: 1Gi
          requests:
            cpu: 1
            memory: 1Gi
      serviceAccountName: arc-system-actions-runner-controller-actions-metrics-server
      terminationGracePeriodSeconds: 10

To Reproduce

1. Enable actions metrics server
2. Configure webhook to send events to metrics server
3. View metrics

Describe the bug

Workflow Job run duration metrics are always in the +Inf bucket due to a bug in the code used to determine the run time. It is relying on the following code to get the queued and started time:

    var (
        queuedTime    time.Time
        startedTime   time.Time
        completedTime time.Time
    )

...

            if strings.HasPrefix(line, "Waiting for a runner to pick up this job...") {
                queuedTime, _ = time.Parse(time.RFC3339, timestamp)
                continue
            }

            if strings.HasPrefix(line, "Job is about to start running on the runner:") {
                startedTime, _ = time.Parse(time.RFC3339, timestamp)
                continue
            }

            // Last line in the log will count as the completed time
            completedTime, _ = time.Parse(time.RFC3339, timestamp)

And then it subtracts the started time from the completion time to calculate the run duration of the job. However, the "Waiting for a runner to pick up this job..." and "Job is about to start running on the runner:" lines only exist in the job output until the job is completed, at which point Github removes from from the output. When those lines aren't present, completedTime is correct but startedTime is the zero value for time.Time, which means you end up with a run duration of eg. 2562047h47m16.854775807s. See https://go.dev/play/p/fPfsc6Wp-gm for example.

Describe the expected behavior

The job duration metric should be correct.

Whole Controller Logs

I know it says not to omit this, but they are not relevant here and they contain repository information I'd rather not be public. There are no errors in the logs. Sorry. Let me know if you really want these and I'll find a private way to get them to you.

Whole Runner Pod Logs

N/A, see above.

Additional Context

Here are the job logs from the same job before and after the job is complete:

Before job is complete:

2024-05-15T20:02:25.0874171Z Requested labels: self-hosted, cicd
2024-05-15T20:02:25.0874499Z Job defined at: <omitted>
2024-05-15T20:02:25.0874598Z Waiting for a runner to pick up this job...
2024-05-15T20:02:25.2431948Z Job is about to start running on the runner: runner-cicd-tdrg7-2fcbn (organization)

After job is complete:

2024-05-15T20:02:29.8079838Z Current runner version: '2.316.1'
2024-05-15T20:02:29.8086381Z Runner name: 'runner-cicd-tdrg7-2fcbn'
2024-05-15T20:02:29.8087297Z Runner group name: 'Default'
2024-05-15T20:02:29.8088200Z Machine name: 'runner-cicd-tdrg7-2fcbn'
2024-05-15T20:02:29.8091270Z ##[group]GITHUB_TOKEN Permissions
2024-05-15T20:02:29.8093123Z Contents: read
2024-05-15T20:02:29.8093671Z Metadata: read
2024-05-15T20:02:29.8094172Z Packages: read
2024-05-15T20:02:29.8094649Z ##[endgroup]
2024-05-15T20:02:29.8098048Z Secret source: Actions
2024-05-15T20:02:29.8098758Z Prepare workflow directory
2024-05-15T20:02:29.9653636Z Prepare all required actions
2024-05-15T20:02:29.9843545Z Getting action download info
...

~I'll be submitting a PR to fix this shortly~ PR which fixes the issue: https://github.com/actions/actions-runner-controller/pull/3532

github-actions[bot] commented 1 month ago

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.