actions / actions-runner-controller

Kubernetes controller for GitHub Actions self-hosted runners
Apache License 2.0
4.66k stars 1.1k forks source link

Add webhook payload detail to info logs #2060

Open chrisforrette opened 1 year ago

chrisforrette commented 1 year ago

Hello 👋 thanks for building this, it has been pretty awesome and helpful!

What would you like added?

I have this configured with the webhook service running and the horizontal autoscaler to run ephemeral runner pods. By default, I ship all service logs to Datadog and filter out the "debug" level, so for the webhook service, I just see logs like this over and over:

2022-11-30T22:50:00Z    INFO    controllers.webhookbasedautoscaler  scaled gha-runner-autoscaler-small by -1

And I would love to see the webhook payload details on this log. I would think that the configuration here: https://github.com/actions-runner-controller/actions-runner-controller/blob/master/controllers/horizontal_runner_autoscaler_webhook.go#L180

...would include those attributes with the log line here: https://github.com/actions-runner-controller/actions-runner-controller/blob/master/controllers/horizontal_runner_autoscaler_webhook.go#L296-L298

...but it doesn't.

I also see log.V(1) and log.V(2) calls above and I've never seen that style of logger before so I'm not sure what those mean.

In any case, this may me a bug, or it may be by design? If it isn't necessarily intentional, I'd be happy to take a stab at a pull request.

Additionally, it would be amazing to apply those same attributes to the pod handling the action somehow, but that seems like a whole other can of worms.

Why is this needed?

I'm finding it really difficult to connect the chain of requests from a triggered Github Action to a pod handling that action, and having these attributes on a successfully routed event would be helpful to connect the dots.

Additional context

In case it may be of any help, I've installed via the Helm chart with the following configuration:

resources:
  requests:
    cpu: "0.5"
    memory: "1Gi"
  limits:
    cpu: "0.5"
    memory: "1Gi"

podLabels:
  tags.datadoghq.com/env: internal
  tags.datadoghq.com/service: actions-runner-controller

logFormat: "json"

githubWebhookServer:
  enabled: true
  secret:
    enabled: true
    create: true
    name: "gha-webhook-secret"
    github_webhook_secret_token: "..."
  service:
    ports:
      - port: 443
        targetPort: http
        protocol: TCP
        name: https
  resources:
    requests:
      cpu: "0.5"
      memory: "1Gi"
    limits:
      cpu: "0.5"
      memory: "1Gi"
  podLabels:
    tags.datadoghq.com/env: internal
    tags.datadoghq.com/service: actions-runner-webhook
  logFormat: "json"
github-actions[bot] commented 1 year ago

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

Moser-ss commented 1 year ago

This request is interesting, but I doubt it will work in the way you exposed it. When ARC scales up, the number of runners doesn't define which job will process by the runner. For example, a GitHub Action Workflow with five jobs is triggered, it will send five workflow_job events, and ARC will scale the number of runners by 5. After the runner registry in GitHub, it will start pooling GitHub to get a job, so the job that triggers the creation of the runner will not be a job processed by that runner.

But if you need to understand which jobs are being processed in each runner, you can check the runner.statusUpdateHook features that update the status.Message and status.Status with the data related to happens inside of the runner

Also, if you want, check this PR where adds the workflow name and the repository

mumoshu commented 1 year ago

FWIW

I also see log.V(1) and log.V(2) calls above and I've never seen that style of logger before so I'm not sure what those mean.

I often see this log-level-as-number pattern in K8s related open-source projects. In ARC, numbers map to log levels like the following:

Trace: log.V(2).Info = --log-level=-2 Debug: log.V(1).Info = --log-level=debug or --log-level=-1 Info: log.Info = --log-level=info or --log-level=0 Error = log.Error = --log-level=error or --log-level=2

mumoshu commented 1 year ago

@chrisforrette I may be missing something, but what versions of ARC and the chart are you using?

You seem to have declared the use of the json log format in your values.yaml and I think you're saying that ARC is printing 2022-11-30T22:50:00Z INFO controllers.webhookbasedautoscaler scaled gha-runner-autoscaler-small by -1 on scale, right? But the log message doesn't seem JSON-formatted.

It would be great if you could share your HRA and RunnerDeployment spec to make everything clear.

chrisforrette commented 1 year ago

@Moser-ss

For example, a GitHub Action Workflow with five jobs is triggered, it will send five workflow_job events, and ARC will scale the number of runners by 5. After the runner registry in GitHub, it will start pooling GitHub to get a job, so the job that triggers the creation of the runner will not be a job processed by that runner.

This is a super helpful explanation! Makes sense why tracing is tricky—I didn't realize there was polling on the ARC side even with the webhook-based scaling. Thanks for this!

@mumoshu Thanks for explaining the log levels!

I too am confused by the log formatting. Here's the version we have installed, which I think is the latest:

helm list -n actions-runner-system
NAME                        NAMESPACE               REVISION    UPDATED                                 STATUS      CHART                               APP VERSION
actions-runner-controller   actions-runner-system   7           2022-12-16 11:26:41.869764 -0500 -0500  deployed    actions-runner-controller-0.21.1    0.26.0 

I thought maybe the config might be malformed but it looks OK...

Here's a describe for an HRA:

Name:         gha-runner-autoscaler-small
Namespace:    actions-runner-system
Labels:       <none>
Annotations:  <none>
API Version:  actions.summerwind.dev/v1alpha1
Kind:         HorizontalRunnerAutoscaler
Metadata:
  Creation Timestamp:  2022-12-14T17:17:13Z
  Generation:          892
  Managed Fields:
    API Version:  actions.summerwind.dev/v1alpha1
    Fields Type:  FieldsV1
    fieldsV1:
      f:spec:
        f:maxReplicas:
        f:metrics:
        f:minReplicas:
        f:scaleDownDelaySecondsAfterScaleOut:
        f:scaleTargetRef:
          f:kind:
          f:name:
        f:scaleUpTriggers:
    Manager:      Terraform
    Operation:    Apply
    Time:         2022-12-16T15:50:29Z
    API Version:  actions.summerwind.dev/v1alpha1
    Fields Type:  FieldsV1
    fieldsV1:
      f:status:
        .:
        f:desiredReplicas:
        f:lastSuccessfulScaleOutTime:
    Manager:         manager
    Operation:       Update
    Subresource:     status
    Time:            2022-12-15T16:47:41Z
  Resource Version:  4721762
  UID:               d3e73d1a-5cac-40c2-992a-c027170fae2f
Spec:
  Max Replicas:  100
  Metrics:
    Scale Down Factor:                       0.5
    Scale Down Threshold:                    0.25
    Scale Up Factor:                         2
    Scale Up Threshold:                      0.75
    Type:                                    PercentageRunnersBusy
  Min Replicas:                              1
  Scale Down Delay Seconds After Scale Out:  120
  Scale Target Ref:
    Kind:  RunnerDeployment
    Name:  gha-runner-small
  Scale Up Triggers:
    Duration:  30m0s
    Github Event:
      Workflow Job:
Status:
  Desired Replicas:                5
  Last Successful Scale Out Time:  2022-12-22T18:57:35Z
Events:                            <none>

And here's a describe for the related RunnerDeployment:

Name:         gha-runner-small
Namespace:    actions-runner-system
Labels:       <none>
Annotations:  <none>
API Version:  actions.summerwind.dev/v1alpha1
Kind:         RunnerDeployment
Metadata:
  Creation Timestamp:  2022-12-14T17:17:12Z
  Generation:          2613
  Managed Fields:
    API Version:  actions.summerwind.dev/v1alpha1
    Fields Type:  FieldsV1
    fieldsV1:
      f:spec:
        f:template:
          f:metadata:
            f:annotations:
              f:ad.datadoghq.com/tags:
            f:labels:
              f:tags.datadoghq.com/env:
              f:tags.datadoghq.com/service:
          f:spec:
            f:dockerEnabled:
            f:env:
            f:ephemeral:
            f:image:
            f:labels:
            f:organization:
            f:resources:
              f:limits:
                f:cpu:
                f:ephemeral-storage:
                f:memory:
              f:requests:
                f:cpu:
                f:ephemeral-storage:
                f:memory:
            f:securityContext:
              f:fsGroup:
            f:serviceAccountName:
            f:volumeSizeLimit:
    Manager:      Terraform
    Operation:    Apply
    Time:         2022-12-16T15:50:27Z
    API Version:  actions.summerwind.dev/v1alpha1
    Fields Type:  FieldsV1
    fieldsV1:
      f:status:
        .:
        f:availableReplicas:
        f:desiredReplicas:
        f:readyReplicas:
        f:replicas:
        f:updatedReplicas:
    Manager:      manager
    Operation:    Update
    Subresource:  status
    Time:         2022-12-14T17:17:12Z
    API Version:  actions.summerwind.dev/v1alpha1
    Fields Type:  FieldsV1
    fieldsV1:
      f:spec:
        f:effectiveTime:
        f:replicas:
    Manager:         manager
    Operation:       Update
    Time:            2022-12-15T16:47:40Z
  Resource Version:  4722768
Spec:
  Effective Time:  2022-12-22T18:55:44Z
  Replicas:        1
  Selector:        <nil>
  Template:
    Metadata:
      Annotations:
        ad.datadoghq.com/tags:  {"github_workflow_labels":["small"]}
      Labels:
        tags.datadoghq.com/env:      internal
        tags.datadoghq.com/service:  github-actions-runners
    Spec:
      Docker Enabled:  true
      Dockerd Container Resources:
      Env:
        Name:     DISABLE_RUNNER_UPDATE
        Value:    true
      Ephemeral:  true
      Image:      summerwind/actions-runner:latest
      Labels:
        small
      Organization:  ***
      Resources:
        Limits:
          Cpu:                  1250m
          Ephemeral - Storage:  40Gi
          Memory:               2250Mi
        Requests:
          Cpu:                  1
          Ephemeral - Storage:  40Gi
          Memory:               2Gi
      Security Context:
        Fs Group:            1000
      Service Account Name:  ci-default
      Volume Size Limit:     40Gi
Status:
  Available Replicas:  1
  Desired Replicas:    1
  Ready Replicas:      1
  Replicas:            1
  Updated Replicas:    1
Events:                <none>

Let me know if any other information might be helpful. Thanks again!

Moser-ss commented 1 year ago

@chrisforrette Just for clarity, it is not ARC that is doing the pooling. ARC creates pods that have the GitHub Runner inside. Then the runner behaves independently if it runs inside a container, VM, or bare metal. It will connect to GitHub and start polling events from GitHub like it is described in the documentation

I didn't realize there was polling on the ARC side even with the webhook-based scaling