actions / runner-container-hooks

Runner Container Hooks for GitHub Actions
MIT License
67 stars 43 forks source link

Workflow pods are still not scaling out. Are they working well for others? #121

Closed dongho-jung closed 8 months ago

dongho-jung commented 8 months ago

Despite the addition of ACTIONS_RUNNER_USE_KUBE_SCHEDULER in #111, which allows listener pods and workflow pods to be deployed on separate nodes, both pods are still referencing the same volume claim, causing the workflow pod to get stuck in the 'container creating' state because of multi attached pvc.

As a result, scaling out when a job is assigned is still not feasible. Am I approaching this incorrectly? How have others solved this issue?

If there's no standard way, I plan to fork it and then add an option like useHostVolume to use it without PVC.

dongho-jung commented 8 months ago

I tried using Volume as hostPath by forking and building it myself. It launches fine, but there's a significant issue. The problem is that the runner Pod writes the script for executing the actual step. Since the volume of the runner and job Pods are separate, the job Pod can't find the step script named in the format ${uuid4()}.sh. As a result, I am considering using NFS. If anyone knows a better solution, please let me know

dongho-jung commented 8 months ago

I've just come back to using dind mode and am doing image caching with kube-fledged. It's much better. I wonder if there is anyone using k8s mode who has autoscaling working?

Firstly, without ACTIONS_RUNNER_USE_KUBE_SCHEDULER, node scale-out won't even work, and even with ACTIONS_RUNNER_USE_KUBE_SCHEDULER, wouldn't there be problems due to the runner and workflow pod not sharing the same volume?

Well, I'm not looking at k8s mode anymore, but I'm curious..

israel-morales commented 8 months ago

We were able to get k8s mode to work reliably by ensuring:

We accomplished this with anti-affinity rules, and without it, we run into resource issues as described here

The cons with this setup is that it's inefficient and we are scheduling a new node for every workflow run....

ACTIONS_RUNNER_USE_KUBE_SCHEDULER wasn't an option for us, due to excessive costs associated with RWX in GKE.

Here is our code for reference, hope it helps:

values.yaml

# https://github.com/actions/actions-runner-controller/blob/gha-runner-scale-set-0.6.1/charts/gha-runner-scale-set/values.yaml

runnerScaleSetName: gke-runner-default
githubConfigUrl: https://github.com/org
githubConfigSecret: github-runner-app-secret
minRunners: 0
maxRunners: 10

## ref to parent gh controller
controllerServiceAccount:
  namespace: gh-actions-controller
  name: gha-runner-scale-set-controller-gha-rs-controller

containerMode:
  type: kubernetes

template:
  spec:
    affinity:
      podAntiAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchExpressions:
            - key: actions.github.com/scale-set-name
              operator: In
              values:
              - gke-runner-default
          topologyKey: kubernetes.io/hostname
    securityContext:
      fsGroup: 1001
    containers:
    - name: runner
      image: ghcr.io/actions/actions-runner:latest
      command: ["/home/runner/run.sh"]
      env:
        - name: ACTIONS_RUNNER_CONTAINER_HOOKS
          value: /home/runner/k8s/index.js
        - name: ACTIONS_RUNNER_POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER   # Flag to allow direct runner tasks
          value: "false"
        - name: ACTIONS_RUNNER_CONTAINER_HOOK_TEMPLATE
          value: /home/runner/pod-templates/default.yaml
        - name: ACTIONS_RUNNER_USE_KUBE_SCHEDULER     # Flag enables separate scheduling for worker pods
          value: "false"
      # These resources define the RUNNER pod
      resources:
        limits:
          cpu: 1000m
          memory: 2000Mi
        requests:
          cpu: 1000m
          memory: 256Mi
      volumeMounts:
        - name: work
          mountPath: /home/runner/_work
        - name: pod-templates
          mountPath: /home/runner/pod-templates
          readOnly: true
    volumes:
      - name: work
        ephemeral:
          volumeClaimTemplate:
            spec:
              accessModes: [ "ReadWriteOnce" ]
              resources:
                requests:
                  storage: 8Gi
      - name: pod-templates
        configMap:
          name: pod-templates

pod-template.yaml

# pod templates to apply to gh action job containers
apiVersion: v1
kind: ConfigMap
metadata:
  name: pod-templates
  namespace: gh-actions-runners
data:
  default.yaml: |
    ---
    apiVersion: v1
    kind: PodTemplate
    metadata:
      name: runner-pod-template
      labels:
        app: runner-pod-template
    spec:
      securityContext:
        fsGroup: 123  # provides access to /home/runner/_work directory in ephemeral volume
      containers:
      - name: $job
        resources:
          requests:
            cpu: 250m
            memory: 256Mi
hbenzekri commented 8 months ago

How were you able to test the behavior when setting ACTIONS_RUNNER_USE_KUBE_SCHEDULER to true ? I thought actions/runner didn't release yet a new image that includes runner-container-hooks-0.5.0 actions-runner.

dongho-jung commented 8 months ago

How were you able to test the behavior when setting ACTIONS_RUNNER_USE_KUBE_SCHEDULER to true ? I thought actions/runner didn't release yet a new image that includes runner-container-hooks-0.5.0 actions-runner.

I forked it builded it published it manually

nikola-jokic commented 8 months ago

Hey everyone,

Sorry for the late response. The issue is that 0.5.0 release is not yet part of the runner. We are waiting for the next runner release to include the hook version so we can avoid re-publishing the runner image with the same version, but the different hook version. We should improve our release cycle so as not to have runner without the latest version of the hook for a long time...