actions / actions-runner-controller

Kubernetes controller for GitHub Actions self-hosted runners
Apache License 2.0
4.74k stars 1.12k forks source link

gha-runner-scale-set-controller fails to launch ephemeral runners due to a missing finalizer's setup #2485

Closed pearljago closed 1 year ago

pearljago commented 1 year ago

Checks

Controller Version

gha-runner-scale-set-controller-0.3.0

Helm Chart Version

0.3.0

CertManager Version

No response

Deployment Method

Helm

cert-manager installation

I have follow the instructions on this link: https://github.com/actions/actions-runner-controller/tree/master/docs/preview/gha-runner-scale-set-controller

I know is still in "beta" stage, I just want to report a "missconfiguration".

Using this approach Cert manager is not required.

Checks

Resource Definitions

apiVersion: actions.github.com/v1alpha1
kind: AutoscalingListener
metadata:
  resourceVersion: '49274'
  name: arc-runner-set-6cd58d58-listener
  uid: d248ca30-ef5e-4673-a077-057d6bbd2c90
  creationTimestamp: '2023-04-04T16:35:25Z'
  generation: 1
  namespace: arc-systems
  finalizers:
    - autoscalinglistener.actions.github.com/finalizer
  labels:
    auto-scaling-runner-set-name: arc-runner-set
    auto-scaling-runner-set-namespace: arc-systems
    runner-spec-hash: 799b5c9579
spec:
  autoscalingRunnerSetName: arc-runner-set
  autoscalingRunnerSetNamespace: arc-systems
  ephemeralRunnerSetName: arc-runner-set-wl4mp
  githubConfigSecret: arc-runner-set-gha-runner-scale-set-github-secret
  githubConfigUrl: 'https://github.com/santander-group-europe'
  image: 'ghcr.io/actions/gha-runner-scale-set-controller:0.3.0'
  maxRunners: 2147483647
  runnerScaleSetId: 3
---------------------------------------------------------------------------
kind: Deployment
apiVersion: apps/v1
metadata:
  annotations:
    deployment.kubernetes.io/revision: '1'
    meta.helm.sh/release-name: arc
    meta.helm.sh/release-namespace: arc-systems
  resourceVersion: '48036'
  name: arc-gha-runner-scale-set-controller
  generation: 1
  managedFields:
    - manager: helm
      operation: Update
      apiVersion: apps/v1
    - manager: kube-controller-manager
      operation: Update
      apiVersion: apps/v1
  namespace: arc-systems
  labels:
    app.kubernetes.io/instance: arc
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: gha-runner-scale-set-controller
    app.kubernetes.io/part-of: gha-runner-scale-set-controller
    app.kubernetes.io/version: 0.3.0
    helm.sh/chart: gha-runner-scale-set-controller-0.3.0
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/instance: arc
      app.kubernetes.io/name: gha-runner-scale-set-controller
  template:
    metadata:
      creationTimestamp: null
      labels:
        app.kubernetes.io/component: controller-manager
        app.kubernetes.io/instance: arc
        app.kubernetes.io/name: gha-runner-scale-set-controller
        app.kubernetes.io/part-of: actions-runner-controller
        app.kubernetes.io/version: 0.3.0
      annotations:
        kubectl.kubernetes.io/default-container: manager
    spec:
      restartPolicy: Always
      serviceAccountName: arc-gha-runner-scale-set-controller
      schedulerName: default-scheduler
      terminationGracePeriodSeconds: 10
      securityContext: {}
      containers:
        - resources: {}
          terminationMessagePath: /dev/termination-log
          name: manager
          command:
            - /manager
          env:
            - name: CONTROLLER_MANAGER_POD_NAME
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: metadata.name
            - name: CONTROLLER_MANAGER_POD_NAMESPACE
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: metadata.namespace
          imagePullPolicy: IfNotPresent
          volumeMounts:
            - name: tmp
              mountPath: /tmp
          terminationMessagePolicy: File
          image: 'ghcr.io/actions/gha-runner-scale-set-controller:0.3.0'
          args:
            - '--auto-scaling-runner-set-only'
            - '--log-level=debug'
      serviceAccount: arc-gha-runner-scale-set-controller

To Reproduce

This is done in an openshift cluster:
CRC version: 2.15.0+72256c3c
OpenShift version: 4.12.5
Podman version: 4.3.1

1. Install deploy runner controller using Helm.

NAMESPACE="arc-systems"
helm install arc \
    --namespace "${NAMESPACE}" \
    --create-namespace \
    oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set-controller \
    --version 0.3.0

2. Deploy autoscaling runner set using GithubAPP authentication.

INSTALLATION_NAME="arc-runner-set"
NAMESPACE="arc-systems"
GITHUB_CONFIG_URL="https://github.com/my_organization"
GITHUB_APP_ID="xxxxxx"
GITHUB_APP_INSTALLATION_ID="xxxxxxx"
GITHUB_APP_PRIVATE_KEY=$(cat app.pem)
helm install arc-runner-set \
    --namespace "${NAMESPACE}" \
    --set githubConfigUrl="${GITHUB_CONFIG_URL}" \
    --set githubConfigSecret.github_app_id="${GITHUB_APP_ID}" \
    --set githubConfigSecret.github_app_installation_id="${GITHUB_APP_INSTALLATION_ID}" \
    --set githubConfigSecret.github_app_private_key="${GITHUB_APP_PRIVATE_KEY}" \
    oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set --version 0.3.0

3. Run the following workflow to test ARC autoscaling.

name: Test scale workflow
on:
    workflow_dispatch:
jobs:
  test:
    runs-on: arc-runner-set
    steps:
    - name: Hello world
      run: echo "Hello world"

Describe the bug

After implementing the previous steps when I run the workflow (step 3), the ARC pod fails to start the ephemeral runner pod. Log shows the following:

2023-04-05T08:48:43Z ERROR Reconciler error {"controller": "ephemeralrunnerset", "controllerGroup": "actions.github.com", "controllerKind": "EphemeralRunnerSet", "EphemeralRunnerSet": {"name":"arc-runner-set-96d29","namespace":"arc-systems"}, "namespace": "arc-systems", "name": "arc-runner-set-96d29", "reconcileID": "1d4e094d-06fa-475a-9f99-c42e54d96c5a", "error": "ephemeralrunners.actions.github.com \"arc-runner-set-96d29-runner-vjm58\" is forbidden: cannot set blockOwnerDeletion if an ownerReference refers to a resource you can't set finalizers on: , "}

After digging a bit, I found out that there is a missing config on the role definition arc-gha-runner-scale-set-controller-manager-role . There you need to setup the missing finalizers "ephemeralrunnersets", something like:

After implementing the previous configuration the error no longer appears and the ARC behaviour is as expected.

Describe the expected behavior

After executing a workflow with the self-hosted appropriate label, a new ephemeral runner should be launched on the namespace where the autoscaling-runner-controller and the listner pods are running.

Whole Controller Logs

2023-04-05T08:48:43Z    INFO    EphemeralRunnerSet  Ephemeral runner counts {"ephemeralrunnerset": "arc-systems/arc-runner-set-96d29", "pending": 0, "running": 0, "finished": 0, "failed": 0, "deleting": 0}
2023-04-05T08:48:43Z    INFO    EphemeralRunnerSet  Scaling comparison  {"ephemeralrunnerset": "arc-systems/arc-runner-set-96d29", "current": 0, "desired": 1}
2023-04-05T08:48:43Z    INFO    EphemeralRunnerSet  Creating new ephemeral runners (scale up)   {"ephemeralrunnerset": "arc-systems/arc-runner-set-96d29", "count": 1}
2023-04-05T08:48:43Z    INFO    EphemeralRunnerSet  Creating new ephemeral runner   {"ephemeralrunnerset": "arc-systems/arc-runner-set-96d29", "progress": 1, "total": 1}
2023-04-05T08:48:43Z    ERROR   EphemeralRunnerSet  failed to make ephemeral runner {"ephemeralrunnerset": "arc-systems/arc-runner-set-96d29", "error": "ephemeralrunners.actions.github.com \"arc-runner-set-96d29-runner-vjm58\" is forbidden: cannot set blockOwnerDeletion if an ownerReference refers to a resource you can't set finalizers on: , <nil>"}
github.com/actions/actions-runner-controller/controllers/actions%2egithub%2ecom.(*EphemeralRunnerSetReconciler).createEphemeralRunners
    github.com/actions/actions-runner-controller/controllers/actions.github.com/ephemeralrunnerset_controller.go:327
github.com/actions/actions-runner-controller/controllers/actions%2egithub%2ecom.(*EphemeralRunnerSetReconciler).Reconcile
    github.com/actions/actions-runner-controller/controllers/actions.github.com/ephemeralrunnerset_controller.go:189
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
    sigs.k8s.io/controller-runtime@v0.14.4/pkg/internal/controller/controller.go:122
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
    sigs.k8s.io/controller-runtime@v0.14.4/pkg/internal/controller/controller.go:323
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
    sigs.k8s.io/controller-runtime@v0.14.4/pkg/internal/controller/controller.go:274
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
    sigs.k8s.io/controller-runtime@v0.14.4/pkg/internal/controller/controller.go:235
2023-04-05T08:48:43Z    ERROR   EphemeralRunnerSet  failed to make ephemeral runner {"ephemeralrunnerset": "arc-systems/arc-runner-set-96d29", "error": "ephemeralrunners.actions.github.com \"arc-runner-set-96d29-runner-vjm58\" is forbidden: cannot set blockOwnerDeletion if an ownerReference refers to a resource you can't set finalizers on: , <nil>"}
github.com/actions/actions-runner-controller/controllers/actions%2egithub%2ecom.(*EphemeralRunnerSetReconciler).Reconcile
    github.com/actions/actions-runner-controller/controllers/actions.github.com/ephemeralrunnerset_controller.go:190
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
    sigs.k8s.io/controller-runtime@v0.14.4/pkg/internal/controller/controller.go:122
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
    sigs.k8s.io/controller-runtime@v0.14.4/pkg/internal/controller/controller.go:323
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
    sigs.k8s.io/controller-runtime@v0.14.4/pkg/internal/controller/controller.go:274
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
    sigs.k8s.io/controller-runtime@v0.14.4/pkg/internal/controller/controller.go:235

Whole Runner Pod Logs

Runner pod fail to start. I include the listener logs:

2023-04-05T08:46:34Z    INFO    getting Actions tenant URL and JWT  {"registrationURL": "https://api.github.com/actions/runner-registration"}
2023-04-05T08:46:36Z    INFO    auto_scaler current runner scale set statistics.    {"statistics": "{\"totalAvailableJobs\":0,\"totalAcquiredJobs\":0,\"totalAssignedJobs\":0,\"totalRunningJobs\":0,\"totalRegisteredRunners\":0,\"totalBusyRunners\":0,\"totalIdleRunners\":0}"}
2023-04-05T08:46:36Z    INFO    service waiting for message...
2023-04-05T08:48:35Z    INFO    service process message.    {"messageId": 1, "messageType": "RunnerScaleSetJobMessages"}
2023-04-05T08:48:35Z    INFO    service current runner scale set statistics.    {"available jobs": 1, "acquired jobs": 0, "assigned jobs": 0, "running jobs": 0, "registered runners": 0, "busy runners": 0, "idle runners": 0}
2023-04-05T08:48:35Z    INFO    service process batched runner scale set job messages.  {"messageId": 1, "batchSize": 1}
2023-04-05T08:48:35Z    INFO    service job available message received. {"RequestId": 17182}
2023-04-05T08:48:35Z    INFO    auto_scaler acquiring jobs. {"request count": 1, "requestIds": "[17182]"}
2023-04-05T08:48:35Z    INFO    auto_scaler acquired jobs.  {"requested": 1, "acquired": 1}
2023-04-05T08:48:35Z    INFO    auto_scaler deleted message.    {"messageId": 1}
2023-04-05T08:48:35Z    INFO    service waiting for message...
2023-04-05T08:48:42Z    INFO    service process message.    {"messageId": 2, "messageType": "RunnerScaleSetJobMessages"}
2023-04-05T08:48:42Z    INFO    service current runner scale set statistics.    {"available jobs": 0, "acquired jobs": 0, "assigned jobs": 1, "running jobs": 0, "registered runners": 0, "busy runners": 0, "idle runners": 0}
2023-04-05T08:48:42Z    INFO    service process batched runner scale set job messages.  {"messageId": 2, "batchSize": 1}
2023-04-05T08:48:42Z    INFO    service job assigned message received.  {"RequestId": 17182}
2023-04-05T08:48:42Z    INFO    auto_scaler acquiring jobs. {"request count": 0, "requestIds": "[]"}
2023-04-05T08:48:42Z    INFO    service try scale runner request up/down base on assigned job count {"assigned job": 1, "decision": 1, "min": 0, "max": 2147483647, "currentRunnerCount": 0}
2023-04-05T08:48:42Z    INFO    KubernetesManager   Created merge patch json for EphemeralRunnerSet update  {"json": "{\"spec\":{\"replicas\":1}}"}
2023-04-05T08:48:42Z    INFO    KubernetesManager   Ephemeral runner set scaled.    {"namespace": "arc-systems", "name": "arc-runner-set-96d29", "replicas": 1}
2023-04-05T08:48:43Z    INFO    auto_scaler deleted message.    {"messageId": 2}
2023-04-05T08:48:43Z    INFO    service waiting for message...

Additional Context

No response

github-actions[bot] commented 1 year ago

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

nikola-jokic commented 1 year ago

Hey @pearljago,

Thank you for reporting this! We will investigate this issue and get back to you :relaxed:

pearljago commented 1 year ago

Thanks @nikola-jokic

I forgot to mention that this is only happening on an openshift cluster. I tried the same steps on a minkube cluster and works as expected.