Open chrisforrette opened 1 year ago
Hello! Thank you for filing an issue.
The maintainers will triage your issue shortly.
In the meantime, please take a look at the troubleshooting guide for bug reports.
If this is a feature request, please review our contribution guidelines.
This request is interesting, but I doubt it will work in the way you exposed it. When ARC scales up, the number of runners doesn't define which job will process by the runner. For example, a GitHub Action Workflow with five jobs is triggered, it will send five workflow_job events, and ARC will scale the number of runners by 5. After the runner registry in GitHub, it will start pooling GitHub to get a job, so the job that triggers the creation of the runner will not be a job processed by that runner.
But if you need to understand which jobs are being processed in each runner, you can check the runner.statusUpdateHook features that update the status.Message
and status.Status
with the data related to happens inside of the runner
Also, if you want, check this PR where adds the workflow name and the repository
FWIW
I also see log.V(1) and log.V(2) calls above and I've never seen that style of logger before so I'm not sure what those mean.
I often see this log-level-as-number pattern in K8s related open-source projects. In ARC, numbers map to log levels like the following:
Trace: log.V(2).Info = --log-level=-2 Debug: log.V(1).Info = --log-level=debug or --log-level=-1 Info: log.Info = --log-level=info or --log-level=0 Error = log.Error = --log-level=error or --log-level=2
@chrisforrette I may be missing something, but what versions of ARC and the chart are you using?
You seem to have declared the use of the json log format in your values.yaml and I think you're saying that ARC is printing 2022-11-30T22:50:00Z INFO controllers.webhookbasedautoscaler scaled gha-runner-autoscaler-small by -1 on scale
, right? But the log message doesn't seem JSON-formatted.
It would be great if you could share your HRA and RunnerDeployment spec to make everything clear.
@Moser-ss
For example, a GitHub Action Workflow with five jobs is triggered, it will send five workflow_job events, and ARC will scale the number of runners by 5. After the runner registry in GitHub, it will start pooling GitHub to get a job, so the job that triggers the creation of the runner will not be a job processed by that runner.
This is a super helpful explanation! Makes sense why tracing is tricky—I didn't realize there was polling on the ARC side even with the webhook-based scaling. Thanks for this!
@mumoshu Thanks for explaining the log levels!
I too am confused by the log formatting. Here's the version we have installed, which I think is the latest:
helm list -n actions-runner-system
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
actions-runner-controller actions-runner-system 7 2022-12-16 11:26:41.869764 -0500 -0500 deployed actions-runner-controller-0.21.1 0.26.0
I thought maybe the config might be malformed but it looks OK...
Here's a describe
for an HRA:
Name: gha-runner-autoscaler-small
Namespace: actions-runner-system
Labels: <none>
Annotations: <none>
API Version: actions.summerwind.dev/v1alpha1
Kind: HorizontalRunnerAutoscaler
Metadata:
Creation Timestamp: 2022-12-14T17:17:13Z
Generation: 892
Managed Fields:
API Version: actions.summerwind.dev/v1alpha1
Fields Type: FieldsV1
fieldsV1:
f:spec:
f:maxReplicas:
f:metrics:
f:minReplicas:
f:scaleDownDelaySecondsAfterScaleOut:
f:scaleTargetRef:
f:kind:
f:name:
f:scaleUpTriggers:
Manager: Terraform
Operation: Apply
Time: 2022-12-16T15:50:29Z
API Version: actions.summerwind.dev/v1alpha1
Fields Type: FieldsV1
fieldsV1:
f:status:
.:
f:desiredReplicas:
f:lastSuccessfulScaleOutTime:
Manager: manager
Operation: Update
Subresource: status
Time: 2022-12-15T16:47:41Z
Resource Version: 4721762
UID: d3e73d1a-5cac-40c2-992a-c027170fae2f
Spec:
Max Replicas: 100
Metrics:
Scale Down Factor: 0.5
Scale Down Threshold: 0.25
Scale Up Factor: 2
Scale Up Threshold: 0.75
Type: PercentageRunnersBusy
Min Replicas: 1
Scale Down Delay Seconds After Scale Out: 120
Scale Target Ref:
Kind: RunnerDeployment
Name: gha-runner-small
Scale Up Triggers:
Duration: 30m0s
Github Event:
Workflow Job:
Status:
Desired Replicas: 5
Last Successful Scale Out Time: 2022-12-22T18:57:35Z
Events: <none>
And here's a describe
for the related RunnerDeployment
:
Name: gha-runner-small
Namespace: actions-runner-system
Labels: <none>
Annotations: <none>
API Version: actions.summerwind.dev/v1alpha1
Kind: RunnerDeployment
Metadata:
Creation Timestamp: 2022-12-14T17:17:12Z
Generation: 2613
Managed Fields:
API Version: actions.summerwind.dev/v1alpha1
Fields Type: FieldsV1
fieldsV1:
f:spec:
f:template:
f:metadata:
f:annotations:
f:ad.datadoghq.com/tags:
f:labels:
f:tags.datadoghq.com/env:
f:tags.datadoghq.com/service:
f:spec:
f:dockerEnabled:
f:env:
f:ephemeral:
f:image:
f:labels:
f:organization:
f:resources:
f:limits:
f:cpu:
f:ephemeral-storage:
f:memory:
f:requests:
f:cpu:
f:ephemeral-storage:
f:memory:
f:securityContext:
f:fsGroup:
f:serviceAccountName:
f:volumeSizeLimit:
Manager: Terraform
Operation: Apply
Time: 2022-12-16T15:50:27Z
API Version: actions.summerwind.dev/v1alpha1
Fields Type: FieldsV1
fieldsV1:
f:status:
.:
f:availableReplicas:
f:desiredReplicas:
f:readyReplicas:
f:replicas:
f:updatedReplicas:
Manager: manager
Operation: Update
Subresource: status
Time: 2022-12-14T17:17:12Z
API Version: actions.summerwind.dev/v1alpha1
Fields Type: FieldsV1
fieldsV1:
f:spec:
f:effectiveTime:
f:replicas:
Manager: manager
Operation: Update
Time: 2022-12-15T16:47:40Z
Resource Version: 4722768
Spec:
Effective Time: 2022-12-22T18:55:44Z
Replicas: 1
Selector: <nil>
Template:
Metadata:
Annotations:
ad.datadoghq.com/tags: {"github_workflow_labels":["small"]}
Labels:
tags.datadoghq.com/env: internal
tags.datadoghq.com/service: github-actions-runners
Spec:
Docker Enabled: true
Dockerd Container Resources:
Env:
Name: DISABLE_RUNNER_UPDATE
Value: true
Ephemeral: true
Image: summerwind/actions-runner:latest
Labels:
small
Organization: ***
Resources:
Limits:
Cpu: 1250m
Ephemeral - Storage: 40Gi
Memory: 2250Mi
Requests:
Cpu: 1
Ephemeral - Storage: 40Gi
Memory: 2Gi
Security Context:
Fs Group: 1000
Service Account Name: ci-default
Volume Size Limit: 40Gi
Status:
Available Replicas: 1
Desired Replicas: 1
Ready Replicas: 1
Replicas: 1
Updated Replicas: 1
Events: <none>
Let me know if any other information might be helpful. Thanks again!
@chrisforrette Just for clarity, it is not ARC that is doing the pooling. ARC creates pods that have the GitHub Runner inside. Then the runner behaves independently if it runs inside a container, VM, or bare metal. It will connect to GitHub and start polling events from GitHub like it is described in the documentation
I didn't realize there was polling on the ARC side even with the webhook-based scaling
Hello 👋 thanks for building this, it has been pretty awesome and helpful!
What would you like added?
I have this configured with the webhook service running and the horizontal autoscaler to run ephemeral runner pods. By default, I ship all service logs to Datadog and filter out the "debug" level, so for the webhook service, I just see logs like this over and over:
And I would love to see the webhook payload details on this log. I would think that the configuration here: https://github.com/actions-runner-controller/actions-runner-controller/blob/master/controllers/horizontal_runner_autoscaler_webhook.go#L180
...would include those attributes with the log line here: https://github.com/actions-runner-controller/actions-runner-controller/blob/master/controllers/horizontal_runner_autoscaler_webhook.go#L296-L298
...but it doesn't.
I also see
log.V(1)
andlog.V(2)
calls above and I've never seen that style of logger before so I'm not sure what those mean.In any case, this may me a bug, or it may be by design? If it isn't necessarily intentional, I'd be happy to take a stab at a pull request.
Additionally, it would be amazing to apply those same attributes to the pod handling the action somehow, but that seems like a whole other can of worms.
Why is this needed?
I'm finding it really difficult to connect the chain of requests from a triggered Github Action to a pod handling that action, and having these attributes on a successfully routed event would be helpful to connect the dots.
Additional context
In case it may be of any help, I've installed via the Helm chart with the following configuration: