actions / actions-runner-controller

Kubernetes controller for GitHub Actions self-hosted runners
Apache License 2.0
4.78k stars 1.13k forks source link

Question: Is it possible to use fargate with the runners? #631

Open frbk opened 3 years ago

frbk commented 3 years ago

I have been trying to get runners deployed on fargate and wasn't able to find any info. So far I encountered couple of issues:

Here is an example of my config for fargate:

kind: RunnerDeployment
metadata:
  name: 4-10-fargate
  namespace: github
spec:
  template:
    metadata:
      labels:
        fargate: "true"
        eks.amazonaws.com/fargate-profile: "github"
    spec:
      repository: <some/repo>
      labels:
        - 4-10-fargate
      resources:
        requests:
          cpu: "4.0"
          memory: "10Gi"
          ephemeral-storage: "5Gi"
      dockerEnabled: false
      image: summerwind/actions-runner-controller

---
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  name: 4-10-fargate
  namespace: github
spec:
  scaleTargetRef:
    name: 4-10-fargate
  minReplicas: 0
  maxReplicas: 64
  metrics:
  - type: TotalNumberOfQueuedAndInProgressWorkflowRuns
    repositoryNames:
    - <some/repo>

Please let me know if you have any suggestions.

mumoshu commented 3 years ago

@frbk Hey!

registration-only pod does not inherit labels from spec:template which causes it to be stuck in the limbo. I was able to apply those labels using argocd.

This is working as intended but might be affecting your use-case, as Fargate requires your pods to have certain labels so that Fargate can discover which pods to be deployed onto it. Perhaps we need to fix how actions-runner-controller creates a registration-only runner pod, in a way that it doesn't rely on empty labels. Or perhaps you can wait for GitHub to add some API and system changes so that we can scale from/to zero without having a registration-only runner. https://github.com/actions-runner-controller/actions-runner-controller/issues/470#issuecomment-841428853

When both runner and registration-only pods come up they seem to crash with this error:

The error says that you're trying to deploy it as a GitHub app and the private key you've provided was invalid. Check the content of the K8s secret that contains the private key.

mumoshu commented 3 years ago

And most importantly, does Fargate supports deploying privileged containers today? In a standard setup, your runner pods and containers need to be privileged to work, especially for docker-in-docker. I thought there's some way to run dind without privileges but you need to set privileged: false on your runner spec and figure other settings out to make it work on Fargate, I think.

frbk commented 3 years ago

Hey @mumoshu . Thanks for the reply. Is privileged: false part of the helm chart? Also, I am reusing the same token if I dont use fargate. I deployed two types of runners fargate one and normal one which just uses machines and that one worked fine with that token but I will investigate. Fargate doesn't work with privileged sadly. For my use case I dont need it because I am trying to run a bunch of rspec tests in the runner with some services and was planning on adding those services as sidecars.

frbk commented 3 years ago

I kinda assumed that I can replica what gitlab ci doing.

mumoshu commented 3 years ago

For my use case I dont need it because I am trying to run a bunch of rspec tests in the runner with some services and was planning on adding those services as sidecars.

@frbk Ah, gotcha! Then it should theoretically work if you set dockerEnabled: false https://github.com/actions-runner-controller/actions-runner-controller/blob/dc5f90025cdf5382d8d1b347483dacf0f3d3757b/api/v1alpha1/runner_types.go#L100-L101

But the issue on empty private key would still be a blocker. BTW, to be extra clear- which pod showed the Error: Client creation failed. authentication failed: log? actions-runner-controller, or a runner pod?

mumoshu commented 3 years ago

privileged: false part of the helm chart?

Nope. It's computed depending on the runner spec provided by you. https://github.com/actions-runner-controller/actions-runner-controller/blob/dc5f90025cdf5382d8d1b347483dacf0f3d3757b/controllers/runner_controller.go#L705

frbk commented 3 years ago

I get this error on the runner pod. actions-runner-controller is good. It seems that the secret is not being mounted when I use fargate. I am going to try mounting it in RunnerDeployment and see if that works.

frbk commented 3 years ago

I have done a bit more investigating and these are the findings. It looks like runner pod is not mounting secrets when running on fargate. I was able to solve this by mounting this secrets in the RunnerDeployment and it looks like this now:

apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  name: 4-10-fargate
  namespace: github
spec:
  template:
    metadata:
      labels:
        fargate: "true"
        eks.amazonaws.com/fargate-profile: "github"
    spec:
      serviceAccountName: "actions-runner-controller"
      repository: <some/repo>
      labels:
        - 4-10-fargate
      resources:
        requests:
          cpu: "4.0"
          memory: "10Gi"
          ephemeral-storage: "5Gi"
      dockerEnabled: false
      image: summerwind/actions-runner-controller
      env:
        - name: GITHUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: controller-manager
              key: github_token
              optional: true
        - name: GITHUB_APP_ID
          valueFrom:
            secretKeyRef:
              name: controller-manager
              key: github_app_id
              optional: true
        - name: GITHUB_APP_INSTALLATION_ID
          valueFrom:
            secretKeyRef:
              name: controller-manager
              key: github_app_installation_id
              optional: true
        - name: GITHUB_APP_PRIVATE_KEY
          value: /etc/actions-runner-controller/github_app_private_key
      volumeMounts:
        - name: controller-manager
          mountPath: "/etc/actions-runner-controller"
          readOnly: true
        - mountPath: /tmp/k8s-webhook-server/serving-certs
          name: cert
          readOnly: true
      volumes:
        - name: controller-manager
          secret:
            secretName: controller-manager
        - name: cert
          secret:
            defaultMode: 420
            secretName: webhook-server-cert

---
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  name: 4-10-fargate
  namespace: github
spec:
  scaleTargetRef:
    name: 4-10-fargate
  minReplicas: 0
  maxReplicas: 64
  metrics:
  - type: TotalNumberOfQueuedAndInProgressWorkflowRuns
    repositoryNames:
    - <some/repo>

However, this doesn't seem to work because runner get stuck on authentication, also it looks like that the runner gets converted into a manager. Here is an example of the log:

2021-06-16T15:34:37.876Z    INFO    controller-runtime.metrics  metrics server is starting to listen    {"addr": ":8080"}
2021-06-16T15:34:37.877Z    INFO    actions-runner-controller   Initializing actions-runner-controller  {"github-api-cahce-duration": "9m50s", "sync-period": "10m0s", "runner-image": "summerwind/actions-runner:latest", "docker-image": "docker:dind", "common-runnner-labels": null, "watch-namespace": ""}
2021-06-16T15:34:37.877Z    INFO    controller-runtime.builder  Registering a mutating webhook  {"GVK": "actions.summerwind.dev/v1alpha1, Kind=Runner", "path": "/mutate-actions-summerwind-dev-v1alpha1-runner"}
2021-06-16T15:34:37.877Z    INFO    controller-runtime.webhook  registering webhook {"path": "/mutate-actions-summerwind-dev-v1alpha1-runner"}
2021-06-16T15:34:37.877Z    INFO    controller-runtime.builder  Registering a validating webhook    {"GVK": "actions.summerwind.dev/v1alpha1, Kind=Runner", "path": "/validate-actions-summerwind-dev-v1alpha1-runner"}
2021-06-16T15:34:37.877Z    INFO    controller-runtime.webhook  registering webhook {"path": "/validate-actions-summerwind-dev-v1alpha1-runner"}
2021-06-16T15:34:37.877Z    INFO    controller-runtime.builder  Registering a mutating webhook  {"GVK": "actions.summerwind.dev/v1alpha1, Kind=RunnerDeployment", "path": "/mutate-actions-summerwind-dev-v1alpha1-runnerdeployment"}
2021-06-16T15:34:37.877Z    INFO    controller-runtime.webhook  registering webhook {"path": "/mutate-actions-summerwind-dev-v1alpha1-runnerdeployment"}
2021-06-16T15:34:37.877Z    INFO    controller-runtime.builder  Registering a validating webhook    {"GVK": "actions.summerwind.dev/v1alpha1, Kind=RunnerDeployment", "path": "/validate-actions-summerwind-dev-v1alpha1-runnerdeployment"}
2021-06-16T15:34:37.877Z    INFO    controller-runtime.webhook  registering webhook {"path": "/validate-actions-summerwind-dev-v1alpha1-runnerdeployment"}
2021-06-16T15:34:37.877Z    INFO    controller-runtime.builder  Registering a mutating webhook  {"GVK": "actions.summerwind.dev/v1alpha1, Kind=RunnerReplicaSet", "path": "/mutate-actions-summerwind-dev-v1alpha1-runnerreplicaset"}
2021-06-16T15:34:37.877Z    INFO    controller-runtime.webhook  registering webhook {"path": "/mutate-actions-summerwind-dev-v1alpha1-runnerreplicaset"}
2021-06-16T15:34:37.877Z    INFO    controller-runtime.builder  Registering a validating webhook    {"GVK": "actions.summerwind.dev/v1alpha1, Kind=RunnerReplicaSet", "path": "/validate-actions-summerwind-dev-v1alpha1-runnerreplicaset"}
2021-06-16T15:34:37.877Z    INFO    controller-runtime.webhook  registering webhook {"path": "/validate-actions-summerwind-dev-v1alpha1-runnerreplicaset"}
2021-06-16T15:34:37.877Z    INFO    actions-runner-controller   starting manager
2021-06-16T15:34:37.877Z    INFO    controller-runtime.manager  starting metrics server {"path": "/metrics"}
2021-06-16T15:34:37.977Z    INFO    controller-runtime.webhook.webhooks starting webhook server
2021-06-16T15:34:37.977Z    INFO    controller-runtime.controller   Starting EventSource    {"controller": "runnerreplicaset-controller", "source": "kind source: /, Kind="}
2021-06-16T15:34:37.978Z    INFO    controller-runtime.controller   Starting EventSource    {"controller": "runnerreplicaset-controller", "source": "kind source: /, Kind="}
2021-06-16T15:34:37.978Z    INFO    controller-runtime.controller   Starting EventSource    {"controller": "horizontalrunnerautoscaler-controller", "source": "kind source: /, Kind="}
2021-06-16T15:34:37.978Z    INFO    controller-runtime.controller   Starting EventSource    {"controller": "runner-controller", "source": "kind source: /, Kind="}
2021-06-16T15:34:37.979Z    INFO    controller-runtime.controller   Starting EventSource    {"controller": "runnerdeployment-controller", "source": "kind source: /, Kind="}
2021-06-16T15:34:37.978Z    INFO    controller-runtime.certwatcher  Updated current TLS certificate
2021-06-16T15:34:37.979Z    INFO    controller-runtime.webhook  serving webhook server  {"host": "", "port": 9443}
2021-06-16T15:34:37.979Z    INFO    controller-runtime.certwatcher  Starting certificate watcher
2021-06-16T15:34:38.078Z    INFO    controller-runtime.controller   Starting Controller {"controller": "runnerreplicaset-controller"}
2021-06-16T15:34:38.078Z    INFO    controller-runtime.controller   Starting Controller {"controller": "horizontalrunnerautoscaler-controller"}
2021-06-16T15:34:38.079Z    INFO    controller-runtime.controller   Starting EventSource    {"controller": "runner-controller", "source": "kind source: /, Kind="}
2021-06-16T15:34:38.079Z    INFO    controller-runtime.controller   Starting EventSource    {"controller": "runnerdeployment-controller", "source": "kind source: /, Kind="}
2021-06-16T15:34:38.079Z    INFO    controller-runtime.controller   Starting Controller {"controller": "runnerdeployment-controller"}
2021-06-16T15:34:38.179Z    INFO    controller-runtime.controller   Starting workers    {"controller": "horizontalrunnerautoscaler-controller", "worker count": 1}
2021-06-16T15:34:38.179Z    INFO    controller-runtime.controller   Starting workers    {"controller": "runnerreplicaset-controller", "worker count": 1}
2021-06-16T15:34:38.179Z    DEBUG   actions-runner-controller.horizontalrunnerautoscaler    Calculated desired replicas of 1    {"horizontalrunnerautoscaler": "github/4-10-fargate", "suggested": 1, "reserved": 0, "min": 1, "cached": 1, "max": 64}
2021-06-16T15:34:38.179Z    DEBUG   controller-runtime.controller   Successfully Reconciled {"controller": "horizontalrunnerautoscaler-controller", "request": "github/4-10-fargate"}
2021-06-16T15:34:38.179Z    DEBUG   controller-runtime.controller   Successfully Reconciled {"controller": "runnerreplicaset-controller", "request": "github/4-10-fargate-mfrgk"}
2021-06-16T15:34:38.279Z    INFO    controller-runtime.controller   Starting Controller {"controller": "runner-controller"}
2021-06-16T15:34:38.279Z    INFO    controller-runtime.controller   Starting workers    {"controller": "runnerdeployment-controller", "worker count": 1}
2021-06-16T15:34:38.280Z    DEBUG   controller-runtime.controller   Successfully Reconciled {"controller": "runnerdeployment-controller", "request": "github/4-10-fargate"}
2021-06-16T15:34:38.379Z    INFO    controller-runtime.controller   Starting workers    {"controller": "runner-controller", "worker count": 1}
2021-06-16T15:34:38.380Z    INFO    actions-runner-controller.runner    Skipped registration check because it's deferred until 2021-06-16 15:35:29 +0000 UTC. Retrying in 49.619892818s at latest   {"runner": "github/4-10-fargate-mfrgk-9sprx", "lastRegistrationCheckTime": "2021-06-16 15:34:29 +0000 UTC", "registrationCheckInterval": "1m0s"}
2021-06-16T15:35:28.125Z    DEBUG   controller-runtime.controller   Successfully Reconciled {"controller": "runnerreplicaset-controller", "request": "github/4-10-fargate-mfrgk"}
2021-06-16T15:35:28.276Z    DEBUG   actions-runner-controller.runner    Runner pod exists but we failed to check if runner is busy. Apparently it still needs more time.    {"runner": "github/4-10-fargate-mfrgk-9sprx", "runnerName": "4-10-fargate-mfrgk-9sprx"}
2021-06-16T15:35:28.276Z    DEBUG   actions-runner-controller.runner    Rechecking the runner registration in 1m10.468889844s   {"runner": "github/4-10-fargate-mfrgk-9sprx"}
2021-06-16T15:35:28.288Z    INFO    actions-runner-controller.runner    Skipped registration check because it's deferred until 2021-06-16 15:36:28 +0000 UTC. Retrying in 58.711814172s at latest   {"runner": "github/4-10-fargate-mfrgk-9sprx", "lastRegistrationCheckTime": "2021-06-16 15:35:28 +0000 UTC", "registrationCheckInterval": "1m0s"}
2021-06-16T15:36:27.136Z    DEBUG   actions-runner-controller.runner    Runner pod exists but we failed to check if runner is busy. Apparently it still needs more time.    {"runner": "github/4-10-fargate-mfrgk-9sprx", "runnerName": "4-10-fargate-mfrgk-9sprx"}
2021-06-16T15:36:27.136Z    DEBUG   actions-runner-controller.runner    Rechecking the runner registration in 1m10.283034151s   {"runner": "github/4-10-fargate-mfrgk-9sprx"}
2021-06-16T15:36:27.139Z    DEBUG   controller-runtime.controller   Successfully Reconciled {"controller": "runnerreplicaset-controller", "request": "github/4-10-fargate-mfrgk"}
2021-06-16T15:36:27.149Z    INFO    actions-runner-controller.runner    Skipped registration check because it's deferred until 2021-06-16 15:37:27 +0000 UTC. Retrying in 58.850736308s at latest   {"runner": "github/4-10-fargate-mfrgk-9sprx", "lastRegistrationCheckTime": "2021-06-16 15:36:27 +0000 UTC", "registrationCheckInterval": "1m0s"}
mumoshu commented 3 years ago

@frbk Thanks. At glance, image: summerwind/actions-runner-controller you've written in RunnerDeployment spec is indeed wrong, as you are basically saying use this controller image to run this runner which results in what you see. Or are you saying that Fargate is somehow setting image: summerwind/actions-runner-controller?

mumoshu commented 3 years ago

FYI, you can use summerwind/actions-runner images https://hub.docker.com/r/summerwind/actions-runner/tags?page=1&ordering=last_updated

frbk commented 3 years ago

OMG! Thanks for pointing out that I was using the wrong image. I am going to update it and redeploy. Will update you shortly!

mumoshu commented 3 years ago

@frbk Thanks for confirming! To be extra sure, let me point out that you should omit env like GITHUB_TOKEN. Necessary envs are configured by the controller so you shouldn't be required to do it yourself. Please share your latest RunnerDeployment YAML and I can verify if its good/bad!

frbk commented 3 years ago

@mumoshu Here is an updated config which seem to work on fargate:

apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  name: 4-10-fargate
  namespace: github
spec:
  template:
    metadata:
      labels:
        fargate: "true"
        eks.amazonaws.com/fargate-profile: "github"
    spec:
      repository: <some/repo>
      labels:
        - 4-10-fargate
      resources:
        requests:
          cpu: "4.0"
          memory: "10Gi"
          ephemeral-storage: "5Gi"
      dockerEnabled: false
      image: summerwind/actions-runner
      sidecarContainers:
        - name: mysql
          image: mysql:latest
          env:
            - name: MYSQL_USER
              value: root
            - name: MYSQL_ALLOW_EMPTY_PASSWORD
              value: "true"
        - name: elasticsearch
          image: elasticsearch:latest
        - name: redis
          image: redis:latest
---
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  name: 4-10-fargate
  namespace: github
spec:
  scaleTargetRef:
    name: 4-10-fargate
  minReplicas: 0
  maxReplicas: 64
  metrics:
  - type: TotalNumberOfQueuedAndInProgressWorkflowRuns
    repositoryNames:
    - <some/repo>

I only had to manually update config for registration-only pod to include labels as I mentioned before.

mumoshu commented 3 years ago

@frbk Awesome! Thanks a lot for sharing your experience!

I only had to manually update config for registration-only pod to include labels as I mentioned before.

I was thinking about this a bit- this can possibly be automated by just removing this line from actions-runner-controller code:

https://github.com/actions-runner-controller/actions-runner-controller/blob/f2e2060ff8cbba6ab18e898e240ddf4afd65eb27/controllers/runnerreplicaset_controller.go#L162

It would be great if you could try removing the code, building and pushing a custom image by running DOCKER_USER=$YOUR_DOCKERHUB_ACCOUNT_NAME make docker-build docker-push, and redeploying your controller to see if it resolves your issue 🙏

FYI, you can find definitions for docker-build and docker-push targets at https://github.com/actions-runner-controller/actions-runner-controller/blob/f2e2060ff8cbba6ab18e898e240ddf4afd65eb27/Makefile#L120-L122 and https://github.com/actions-runner-controller/actions-runner-controller/blob/f2e2060ff8cbba6ab18e898e240ddf4afd65eb27/Makefile#L137-L139.

frbk commented 3 years ago

Thanks for the info! Will give this a shot.

frbk commented 3 years ago

@mumoshu Tried your suggestions and removing runnerForScaleFromToZero.ObjectMeta.Labels = nil seemed to work! :tada:

mumoshu commented 3 years ago

@frbk Awesome! Jus to be sure, did scale to/from zero both worked and replicas numbers shown in kubectl get runnerdeployment seem correct?

frbk commented 3 years ago

Looks like it @mumoshu . Example of nothing running on the ci:

NAME                 DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
4-10-fargate         0         0         0            0           7m27s

Executed one job on the ci:

NAME                 DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
4-10-fargate         1         1         1            0           18m

For testing purposes I set maxReplicas to 1.

frbk commented 3 years ago

Just finished running a pipeline with 18 jobs in it and it was able to scale up and down with no issues :tada:

mumoshu commented 3 years ago

@frbk Thanks a lot for confirming! Let me add this to our documentation with a big "thanks to @frbk" note, and also apply the patch https://github.com/actions-runner-controller/actions-runner-controller/issues/631#issuecomment-862959111 to our main branch so that you no longer need to use the fork just for the one-line change.

As this being an open-source and open-development project, I would also appreciate it very much if you could submit any pull request for any of (or even both) changes yourself!

frbk commented 3 years ago

Going to open a pr related to everything we talked about in this issue. Was gathering some info for documentation.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

mumoshu commented 2 years ago

@mumoshu Tried your suggestions and removing runnerForScaleFromToZero.ObjectMeta.Labels = nil seemed to work!

Probably this code change on runnerForScaleFromToZero isn't needed anymore. We no longer create registration-only runners for scale-from-to-zero in recent versions of ARC.

mumoshu commented 2 years ago

@frbk Hey! How have your fargated-based runners been working since then?

frbk commented 2 years ago

Hey @mumoshu. I have moved away to another company from then, however they were working fine when you didnt need to use docker in docker. I will try to provide a bit more info later this week. Just need to go over my old notes. Also, I see you changes the implementation for scaling from zero. I will try this over the weekend and will let you know.

frbk commented 2 years ago

Give me couple more days. Had to setup a test eks cluster and it took a bit longer than I was expecting. Will update after I try out latest version of controller.

frbk commented 2 years ago

Didn't forget about this. Schedule is a bit all over the place at the moment. 😭

mumoshu commented 2 years ago

@frbk Thanks! I'm looking forward to your report ☺️

NoamGoren commented 2 years ago

Hi @frbk @mumoshu I'm looking to implement the runner on fargate as well, anything I should be aware of? does the runnerForScaleFromToZero.ObjectMeta.Labels = nil still needed?

mumoshu commented 2 years ago

@NoamGoren Honestly, I have never tried it myself so I'm afraid I have nothing to share with you! What I can say, FWIW, is that ARC does not rely on registration-only runners anymore. So there may be a chance that it would work without any modifications now.

jerry153fish commented 2 years ago

hi @mumoshu the fargate is still not supporting the privileged containers, is way around to make the docker work? If disable the docker, then the use cases are very limited.

pkordes commented 1 year ago

I thought the privileged containers were only necessary when using the docker sidecar? DinD has many implementations that do not require privileged?

gabegreenwood commented 1 year ago

@mumoshu We are trying to use ARC with fargate, and I've come up with a very simple working hello-world deployment config that works, but it requires dockerEnabled: false in order to run, and I'm not clear on what exactly that entails. What is the Docker in Docker implementation on ARC and why is it important? Will I be able to run docker on my pods at all with this configuration? Here's the config:

apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  name: gg-test-org-deployment-0
spec:
  replicas: 1
  template:
    metadata:
      labels:
        provider: fargate
    spec:
      organization: gg-test-org
      dockerEnabled: false
mumoshu commented 1 year ago

Without dind, you cannot use service containers, container-based actions, and container-based steps in GitHub Actions! However, dind requires privileged containers, which are not available in Fargate. Have you already tried Kubernetes container mode in ARC? Perhaps it has more possibility of success, although I have never tried it with Fargate.

neptune19821220 commented 1 year ago

After setting dockerEnabled: false, we can use ARC in AWS EKS fargate. Though the runner can do less job without privileged permission, it works well for some simple job such as sync objects between AWS partitions.