Add option to use the kubernetes scheduler for workflow pods

Wielewout commented 10 months ago

As described in https://github.com/actions/runner-container-hooks/issues/112 workflow pods aren't going via the kubernetes scheduler, resulting in failed "initialize containers" steps if the already selected node is out of resources. With the following changes I was able to get this working.

This PR adds two new environment variables: ACTIONS_RUNNER_USE_KUBE_SCHEDULER and ACTIONS_RUNNER_PREPARE_JOB_TIMEOUT_SECONDS.

When set to true, the former skips setting the node name and let's the kubernetes scheduler handle the workflow pod. This would only work on a single node cluster (important for the newly added test) or when ReadWriteMany volumes are used.

The latter can be used to increase the timeout to for example 6 hours from the default 10 minutes. This will be necessary for scenarios where the cluster runs out of resources and we want to handle this more gracefully. When using the kubernetes scheduler the pod will remain pending until resources are available somewhere in the cluster.

Also a quick fix was added to skip the name override warning which is always there when using the pod template.

Wielewout commented 10 months ago

@nikola-jokic addressed the comments.

I had the idea yesterday that this should maybe be taken further with pod affinity (for a separate PR?). ReadWriteOnce volumes will still always fail early as the workflow pod will still always get the node name assigned. But we can go through the scheduler and get on the same node with pod affinity by attracting the workflow pod to the runner pod.

The problem now still is that a unique label should be added to the runner pod (e.g. actions.github.com/runner-name=<ephemeral-runner-name-also-pod-name>) within actions-runner-controller when an ephemeral runner is creating its pod. The hooks can access the name via the environment (ACTIONS_RUNNER_POD_NAME) to inject the name in the affinity expression (which then needs to be underneath requiredDuringSchedulingIgnoredDuringExecution for ReadWriteOnce volumes). Then the kube scheduler can use the label to schedule the workflow pod on the same node as the runner or it will keep the pod pending until resources are available.

Let me know if this is something I should look into a bit further :smile:

guillaumevillemont commented 10 months ago

FTR I've been testing this MR for the last couple of days and, except some fiddling with EFS permissions in RWX mode, it works just fine and solves my issue. Can't wait to see this merged, thanks a lot :pray:

If you're looking for some examples to document, here is my EFS config in EKS that seems to be bulletproof :

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: efs-dynamic-provisioning
  annotations:
    storageclass.kubernetes.io/is-default-class: 'false'
provisioner: efs.csi.aws.com
parameters:
  basePath: /dynamic_provisioning
  directoryPerms: '777'
  ensureUniqueDirectory: 'true'
  fileSystemId: fs-xxxxxxxxxx
  uid: '0'
  gid: '0'
  provisioningMode: efs-ap
  reuseAccessPoint: 'false'
  subPathPattern: ${.PVC.namespace}/${.PVC.name}
reclaimPolicy: Delete
volumeBindingMode: Immediate

uid/gid to "0" is required by the workflow pod (when using action/checkout@v3 for example, which expects filesystem to be owned by user, so root) and perms 777 so runner pod can write to _work while running with uid:1001 gid:123.

israel-morales commented 10 months ago

Excellent contrib. I will wait for the merge/release before testing on GKE.

This solution is much more elegant than what I would have suggested, which was to attach the -workflow pod to the runner via a sidecar. The sidecar approach would also address the volumeMount dependency and ensure both containers are properly scheduled by the kubeScheduler as a single pod, respecting container-level requests/limits. However, I'm sure there are pitfalls with the sidecar logic.

This PR essentially decouples the runner from the container workflow, and IMO, should be the default configuration, with a caveat of having the RWM as a prerequisite.

Looking forward to this release. Thanks!

johnjeffers commented 10 months ago

I've recently come to use Actions after being in Gitlab for a few years. The way Gitlab handles self-hosted runners in k8s is simple. You can specify whatever docker image you like in the pipeline code, and that's the one it uses for the runner. There's no concept of runner+worker, no 2nd pod at all. If you want a Service, like postgres, or redis, or even DinD, it's a sidecar in that pod. So there's no need for ephemeral storage, and no worries about scheduling because everything happens in a single pod.

I hope that doesn't come off as coming in here and complaining "why can't you be like Gitlab". That's not my intention at all. I would just like to understand why things are this way with Actions runners. The runner+worker pattern makes configuring and using self-hosted k8s runners a lot more complicated than I was expecting.

nikola-jokic commented 10 months ago

Hey @johnjeffers,

Thank you for raising the question! The way GitHub runner is structured is to have a listener and the worker. The RunnerListener picks up the job, evaluates it, and then passes it down to the worker. That way, the runner can be ready to receive a job before the job is known to the runner (while the runner is idle). What I understood from your message would be equivalent to pulling out the listener's responsibilities inside the controller and then creating a runner worker with side-car containers. Since the listener is part of the runner, the same runner can be configured to run docker or hook commands. Having the listener accept the job allows us to pre-emptively scale up or down (min-runners field in ARC), so as soon as the job is acquired, the runner would be able to pick it up. It is a trade-off with volumes, but the runner is designed to work on any kind of machine, being inside the virtual machine, or inside the k8s.

nikola-jokic commented 10 months ago

@Wielewout as for the suggestion here, it does sound good to me, but it will slightly complicate the hook. The hook was meant to be a simple starting point. However, the suggestion is great in my opinion. So I will propose it to the team and see if there are any objections. Again, thank you for these awesome contributions! And thank you, @guillaumevillemont, for raising this issue and testing it out!

israel-morales commented 9 months ago

@Wielewout @nikola-jokic regarding the suggestion here, is there a way to do this with the existing implementation?

I'm afraid RWX is not an option for us, as the minimum size for a single dynamic GCP filestore csi persistent volume is 1TiB 😞.

Can the actions.github.com/scale-set-name and pod-template-hash labels be used somehow? Thanks!

> k get po --show-labels
NAME                                             READY   STATUS    RESTARTS   AGE   LABELS
gke-runner-default-rp68p-runner-vw49d            1/1     Running   0          36m   actions-ephemeral-runner=True,actions.github.com/organization=companyb,actions.github.com/scale-set-name=gke-runner-default,actions.github.com/scale-set-namespace=gh-actions-runners,app.kubernetes.io/component=runner,app.kubernetes.io/part-of=gha-runner-scale-set,app.kubernetes.io/version=0.7.0,pod-template-hash=74647967fb

nikola-jokic commented 9 months ago

With the current implementation of the hook, it is not. We are setting the node name explicitly. I think that having this feature would require a trivial change on the ARC side to set the label of the pod and potentially an env. Then with the hook extension I suppose it could be done, but I have to say it would be tricky. I will bring this solution up to the team

isker commented 4 months ago

I would just like to understand why things are this way with Actions runners. The runner+worker pattern makes configuring and using self-hosted k8s runners a lot more complicated than I was expecting.

I'm in the middle of trying to switch a self-hosted k8s CI setup from GitLab to GitHub myself. While I certainly don't know why ARC is architected exactly how it is, I can say that the requirements imposed by the feature set of GitHub actions demand a more complex implementation.

The biggest requirement that I can see is that a single job can involve multiple containerized actions executed in series (in the job's steps), each working on the same filesystem (which belongs to the job). And with the kubernetes hooks in this repo, this is done with PersistentVolumes. As far as I know, GitLab has nothing like this.

I've seen many more usages of PersistentVolumes in the k8s hooks implementation, but that one seems to be the most onerous, from an implementer's perspective. I am dreaming of an implementation of the hooks that simply doesn't support features like that one, so that PVs could be ditched 🌞. It seems that the maintainers have tried to find alternatives to using PVs, to no avail.

As it is, I agree with you that it seems pretty hard to achieve the same kinds of outcomes with ARC's k8s mode as you could with GitLab's k8s runner, with respect to things like autoscaling and scheduling.

actions / runner-container-hooks

Add option to use the kubernetes scheduler for workflow pods #111