cloudbase / garm

GitHub Actions Runner Manager
Apache License 2.0
136 stars 26 forks source link

support for control plane highly available #260

Open pathcl opened 5 months ago

pathcl commented 5 months ago

Dear folks,

Im reading through garm codebase and already spotted there’s support for MySQL. Is it enough to configure garm as highly available control plane? my use case is on top of k8s.

gabriel-samfira commented 5 months ago

It used to have MySQL, but it became somewhat difficult to maintain, so we kind of dropped MySQL support.

But if you're trying to use it in k8s, I encourage you to use the garm-operator. The operator pretty much treats GARM as stateless and syncs the sqlite DB using the info it has stored in etcd.

The current push to move some things from the config to the DB is being done in order to eventually have GARM scale-out. So scaling out GARM is on the TODO list and we're working towards that, but even in the current state, it handles a large amount of runners with ease.

pathcl commented 5 months ago

Great! thanks for the quick reply. I tried the k8s operator but I understood it would also require a garm instance outside of the cluster or being reachable. Is this correct?

gabriel-samfira commented 5 months ago

You can have GARM run inside k8s without a problem. Have a look here:

https://github.com/mercedes-benz/garm-provider-k8s/blob/main/DEVELOPMENT.md

The instructions use tilt to bootstrap a local development environment along with garm, the operator and the k8s provider. You can use that as a starting point and expand to other providers you may need.

We need to add some proper docs in one place that gives a nice walk-through for the various cases.

pathcl commented 5 months ago

You can have GARM run inside k8s without a problem. Have a look here:

https://github.com/mercedes-benz/garm-provider-k8s/blob/main/DEVELOPMENT.md

The instructions use tilt to bootstrap a local development environment along with garm, the operator and the k8s provider. You can use that as a starting point and expand to other providers you may need.

We need to add some proper docs in one place that gives a nice walk-through for the various cases.

thanks for sharing that! would you say is the only thing and enough to start? I can improve docs once I get familiar with it

gabriel-samfira commented 5 months ago

That should bring up up and running with a fully functional GARM on k8s + operator. I usually run it as stand-alone, but I did manage to get it running using that guide.

@bavarianbidi may be able to chime in with more details. His wonderful team develops the k8s integration (operator and provider)

pathcl commented 5 months ago

That should bring up up and running with a fully functional GARM on k8s + operator. I usually run it as stand-alone, but I did manage to get it running using that guide.

@bavarianbidi may be able to chime in with more details. His wonderful team develops the k8s integration (operator and provider)

Are you using any specific commit? I can't get garm deployed.

garm-provider-k8s $ make tilt-up
hack/scripts/kind-with-registry.sh
No kind clusters found.
Creating cluster "garm" ...
 βœ“ Ensuring node image (kindest/node:v1.28.7) πŸ–Ό
 βœ“ Preparing nodes πŸ“¦
 βœ“ Writing configuration πŸ“œ
 βœ“ Starting control-plane πŸ•ΉοΈ
 βœ“ Installing CNI πŸ”Œ
 βœ“ Installing StorageClass πŸ’Ύ
Set kubectl context to "kind-garm"
You can now use your cluster with:

kubectl cluster-info --context kind-garm

Thanks for using kind! 😊
configmap/local-registry-hosting created
tilt up
Tilt started on http://localhost:10350/
v0.33.16, built 2024-06-07

(space) to open the browser
(s) to stream logs (--stream=true)
(t) to open legacy terminal mode (--legacy=true)
(ctrl-c) to exit

garm-provider-k8s $ git remote -v
origin  git@github.com:mercedes-benz/garm-provider-k8s (fetch)
origin  git@github.com:mercedes-benz/garm-provider-k8s (push)

garm-provider-k8s $ git rev-parse HEAD
b45a9889943b80d5d6e8222ab6c22a5f59e02157

garm-provider-k8s $ kubectl get pods -A
NAMESPACE            NAME                                         READY   STATUS    RESTARTS   AGE
kube-system          coredns-5dd5756b68-72csd                     1/1     Running   0          10s
kube-system          coredns-5dd5756b68-mmlqp                     1/1     Running   0          10s
kube-system          etcd-garm-control-plane                      1/1     Running   0          25s
kube-system          kindnet-5m6t5                                1/1     Running   0          11s
kube-system          kube-apiserver-garm-control-plane            1/1     Running   0          27s
kube-system          kube-controller-manager-garm-control-plane   1/1     Running   0          25s
kube-system          kube-proxy-r5ffl                             1/1     Running   0          11s
kube-system          kube-scheduler-garm-control-plane            1/1     Running   0          25s
local-path-storage   local-path-provisioner-7577fdbbfb-9q9ks      1/1     Running   0          10s

Garm should be deployed according to step 3) in https://github.com/mercedes-benz/garm-provider-k8s/blob/main/DEVELOPMENT.md#getting-started πŸ€”

gabriel-samfira commented 5 months ago

I used main but you may need to add description = "garm credentials" here:

https://github.com/mercedes-benz/garm-provider-k8s/blob/main/hack/local-development/kubernetes/configmap-envsubst.yaml#L47

But other than that, I just installed, docker, kubectl, tilt, go and went through the steps.

gabriel-samfira commented 5 months ago

you can also edit the existing config map:

kubectl -n garm-server edit configmap garm-configuration

and add it. Then remove the failing containers.

At the end you should have something like:

root@garm-deleteme:~# kubectl get pods -A
NAMESPACE              NAME                                                READY   STATUS    RESTARTS   AGE
cert-manager           cert-manager-5bd57786d4-jmwdj                       1/1     Running   0          58m
cert-manager           cert-manager-cainjector-57657d5754-89fwt            1/1     Running   0          58m
cert-manager           cert-manager-webhook-7d9f8748d4-npk9b               1/1     Running   0          58m
garm-operator-system   garm-operator-controller-manager-69fbd5c478-ctlqt   1/1     Running   0          47m
garm-server            garm-server-5b84b7f66-r7mxp                         1/1     Running   0          48m
kube-system            coredns-5dd5756b68-g8k87                            1/1     Running   0          58m
kube-system            coredns-5dd5756b68-wzxwj                            1/1     Running   0          58m
kube-system            etcd-garm-control-plane                             1/1     Running   0          59m
kube-system            kindnet-7r7mh                                       1/1     Running   0          58m
kube-system            kube-apiserver-garm-control-plane                   1/1     Running   0          59m
kube-system            kube-controller-manager-garm-control-plane          1/1     Running   0          59m
kube-system            kube-proxy-jz67s                                    1/1     Running   0          58m
kube-system            kube-scheduler-garm-control-plane                   1/1     Running   0          59m
local-path-storage     local-path-provisioner-7577fdbbfb-9bpx4             1/1     Running   0          58m
pathcl commented 5 months ago

you can also edit the existing config map:

kubectl -n garm-server edit configmap garm-configuration

and add it. Then remove the failing containers.

At the end you should have something like:

root@garm-deleteme:~# kubectl get pods -A
NAMESPACE              NAME                                                READY   STATUS    RESTARTS   AGE
cert-manager           cert-manager-5bd57786d4-jmwdj                       1/1     Running   0          58m
cert-manager           cert-manager-cainjector-57657d5754-89fwt            1/1     Running   0          58m
cert-manager           cert-manager-webhook-7d9f8748d4-npk9b               1/1     Running   0          58m
garm-operator-system   garm-operator-controller-manager-69fbd5c478-ctlqt   1/1     Running   0          47m
garm-server            garm-server-5b84b7f66-r7mxp                         1/1     Running   0          48m
kube-system            coredns-5dd5756b68-g8k87                            1/1     Running   0          58m
kube-system            coredns-5dd5756b68-wzxwj                            1/1     Running   0          58m
kube-system            etcd-garm-control-plane                             1/1     Running   0          59m
kube-system            kindnet-7r7mh                                       1/1     Running   0          58m
kube-system            kube-apiserver-garm-control-plane                   1/1     Running   0          59m
kube-system            kube-controller-manager-garm-control-plane          1/1     Running   0          59m
kube-system            kube-proxy-jz67s                                    1/1     Running   0          58m
kube-system            kube-scheduler-garm-control-plane                   1/1     Running   0          59m
local-path-storage     local-path-provisioner-7577fdbbfb-9bpx4             1/1     Running   0          58m

Thanks for the tip about configmap. I was able to fix that part. But now Im getting a different error:

$ kubectl get pool -n garm-operator-system -o yaml

  status:
    id: ""
    lastSyncError: referenced GitHubScopeRef Organization/my-org-here not ready yet
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

I created a PAT (classical token) but not sure what's going on. I followed this: https://github.com/mercedes-benz/garm-operator/blob/main/DEVELOPMENT.md#%EF%B8%8F-bootstrap-garm-server-with-garm-provider-k8s-for-local-development. Did you use a Github App for authentication?

gabriel-samfira commented 5 months ago

I used PAT auth. Make sure that the PAT you're using has access to the org/repo/enterprise you're creating and that you enabled the required scopes when creating the PAT. See:

https://github.com/cloudbase/garm/blob/main/doc/github_credentials.md

gabriel-samfira commented 5 months ago

ahh. I think I know what's happening. The operator is not yet updated to take into account the recent changes to GARM regarding the URLs. Try adding:

webhook_url = "http://garm-server.garm-server.svc:9997/webhooks"

here:

https://github.com/mercedes-benz/garm-provider-k8s/blob/b45a9889943b80d5d6e8222ab6c22a5f59e02157/hack/local-development/kubernetes/configmap-envsubst.yaml#L12

If you can connect using garm-cli to the garm server, you can also update using garm-cli controller update.

gabriel-samfira commented 5 months ago

I think it would be best if you switch garm to v0.1.4. The main branch has a bunch of updates and the operator has not caught up yet.

You can set v0.1.4 here:

https://github.com/mercedes-benz/garm-provider-k8s/blob/b45a9889943b80d5d6e8222ab6c22a5f59e02157/hack/Dockerfile#L6

gabriel-samfira commented 5 months ago

also, to get webhooks from GitHub, you'll most likely need an ingress controller and a cluster IP set on the GARM server. Then you'll need to add your webhook in GitHub to point to your GARM webhook URL.

See: https://github.com/cloudbase/garm/blob/v0.1.4/doc/webhooks.md

pathcl commented 5 months ago

also, to get webhooks from GitHub, you'll most likely need an ingress controller and a cluster IP set on the GARM server. Then you'll need to add your webhook in GitHub to point to your GARM webhook URL.

that's right but would I need a webhook to have a pool of runners working? I'm not sure.

BTW thanks to your help it worked! Im noticing these are configured as ephemeral runners by default:

garm-provider-k8s $ kubectl get pods -n runner
NAME                READY   STATUS      RESTARTS   AGE
garm-d7appm3tvuks   0/1     Completed   0          5m28s
garm-evctriwqbovn   0/1     Completed   0          5m28s

Do you think we could have Github Apps supported? it looks like it's already from garm-server side there but we're missing some bits between the release of v0.1.5 and garm-provider-k8s

gabriel-samfira commented 5 months ago

You don't need webhooks for pools to work, but you do need them to know when to spin up a runner and when to delete it. Otherwise you'll have huge delays between when a job is started and when a runner is spun up.

Github app support will probably be added once 0.1.5 is released, depending on how much time the nice folks from mercedes-benz have.

gabriel-samfira commented 5 months ago

GARM only spins up ephemeral runners. No persistent runners.

pathcl commented 5 months ago

GARM only spins up ephemeral runners. No persistent runners.

in any case those runners spawn didn't run anything. Log:

An error occurred: Not configured. Run config.(sh/cmd) to configure the runner.
Runner listener exit with terminated error, stop the service, no retry needed.
Exiting runner...

They did registered to github.com but they were not able to run any workflow 😒

gabriel-samfira commented 5 months ago

I thinknthe summerwind image used by default can't handle JIT configs. You need to either disable JIT or build and use the "upstream" image.

The upstream image:

https://github.com/mercedes-benz/garm-provider-k8s/tree/main/runner/upstream

To disable JIT, add:

disable_jit_config = true

In the provider section of the config:

https://github.com/mercedes-benz/garm-provider-k8s/blob/b45a9889943b80d5d6e8222ab6c22a5f59e02157/hack/local-development/kubernetes/configmap-envsubst.yaml#L39

gabriel-samfira commented 5 months ago

Context for the image:

https://github.com/mercedes-benz/garm-provider-k8s/pull/8

gabriel-samfira commented 5 months ago

@pathcl you will most likely need to apply this patch as well:

https://github.com/mercedes-benz/garm-provider-k8s/pull/52

to build:

cd garm-provider-k8s/runner/upstream
docker build -t localhost:5000/runner-default:latest .
docker push localhost:5000/runner-default:latest

Then just apply the new image:

kubectl -n garm-operator-system patch image runner-default --type=merge --patch '{"spec": { "tag": "localhost:5000/runner-default:latest"}}'

And you should be fine with both JIT and registration token.

pathcl commented 5 months ago

@pathcl you will most likely need to apply this patch as well:

mercedes-benz/garm-provider-k8s#52

to build:

cd garm-provider-k8s/runner/upstream
docker build -t localhost:5000/runner-default:latest .
docker push localhost:5000/runner-default:latest

Then just apply the new image:

kubectl -n garm-operator-system patch image runner-default --type=merge --patch '{"spec": { "tag": "localhost:5000/runner-default:latest"}}'

And you should be fine with both JIT and registration token.

Thanks! it worked now I can see idle runners. However I don't see jobs being picked up. I used runs-on: [self-hosted, Linux, kubernetes] for labels. This is my pool def:

apiVersion: garm-operator.mercedes-benz.com/v1alpha1
kind: Pool
metadata:
  labels:
    app.kubernetes.io/instance: pool-sample
    app.kubernetes.io/name: pool
    app.kubernetes.io/part-of: garm-operator
  name: k8s-pool
  namespace: garm-operator-system
spec:
  githubScopeRef:
    apiGroup: garm-operator.mercedes-benz.com
    kind: Organization
    name: labs
  enabled: true
  extraSpecs: "{}"
  flavor: medium
  githubRunnerGroup: ""
  imageName: runner-default
  maxRunners: 4
  minIdleRunners: 2
  osArch: amd64
  osType: linux
  providerName: kubernetes_external # this is the name defined in your garm server
  runnerBootstrapTimeout: 20
  runnerPrefix: ""
  tags:
    - linux
    - kubernetes
---

Did you have to change anything else?

gabriel-samfira commented 5 months ago

try targeting just: linux or kubernetes (or both) as tags in your workflows. Don't target self-hosted and Linux (capital letter)

gabriel-samfira commented 5 months ago

FYI, until you set up the webhook endpoint, GARM won't be able to autoscale. You'll still get some cleanup/min-idle-runners. But it will be only when GARM consolidates instead of reacting right away.

pathcl commented 5 months ago

FYI, until you set up the webhook endpoint, GARM won't be able to autoscale. You'll still get some cleanup/min-idle-runners. But it will be only when GARM consolidates instead of reacting right away.

I was finally able to run a workflow! thanks so much. Do we have docs for configuring webhook endpoint? at this point I only see two things in my setup

pathcl commented 5 months ago

FYI, until you set up the webhook endpoint, GARM won't be able to autoscale. You'll still get some cleanup/min-idle-runners. But it will be only when GARM consolidates instead of reacting right away.

I'm expecting these runners to be ephemeral but it seems idle runners are not being recreated once they've been used. Shouldn't we have always some runners waiting for jobs?

gabriel-samfira commented 5 months ago

GARM doesn't know that the runner has finished running a job if webhooks don't work. They will eventually be reaped by the consolidation loop that looks in github and locally and kills used runners. Then the same consolidation loop will create missing runners based on min-idle-runners.

If you set up your webhooks, this will happen automatically, right away.

gabriel-samfira commented 5 months ago

There are 2 ways to set up webhooks:

in both cases, your webhook endpoint must be accessible by GitHub.

gabriel-samfira commented 5 months ago

You can access the GARM API directly by running the following steps:

Get the GARM admin password:

grep 'garm-password=' ~/garm-provider-k8s/hack/local-development/kubernetes/garm-operator-all.yaml | sed 's/.*=//g'

Exec into the garm-server pod

kubectl -n garm-server exec -it garm-server-5b84b7f66-rxxxp sh

Replace the pod name with your own. Then, log into the GARM server using the GARM CLI:

garm-cli profile add --name garm --password <your_garm_password> --url http://garm-server.garm-server.svc:9997/ --username admin

then you can view info about your controller, install webhooks, etc:

garm-cli controller-info show

Make sure that the Controller Webhook URL is accessible by GitHub. If you're on v0.1.4, you will need to edit the config map to set the webhook_url. You will most likely need an ingress controller and to expose that URL to the internet via a reverse proxy or port forwarding.

if your webhook url is already accessible by GitHub and your PAT allows webhook management, you can run

garm-cli org webhook install <org_id>
gabriel-samfira commented 5 months ago

There is an explanation about the URLs here: https://github.com/cloudbase/garm/blob/main/doc/using_garm.md#controller-operations

gabriel-samfira commented 5 months ago

If you're using kind, you'll most likely want to expose the service using a NodePort or LoadBalancer type. Then set up something like ngrok to create a tunnel to the node IP/port. If you're using a production k8s with a proper load balancer, once you expose the deployment, you'll most likely want to use the external IP/port as a base URL for all 3:

This will allow you to use the same GARM instance with multiple providers like Azure, GCP, OpenStack, OCI, etc.

pathcl commented 5 months ago

If you're using kind, you'll most likely want to expose the service using a NodePort or LoadBalancer type. Then set up something like ngrok to create a tunnel to the node IP/port. If you're using a production k8s with a proper load balancer, once you expose the deployment, you'll most likely want to use the external IP/port as a base URL for all 3:

* callback_url - needs to be accessible by runners (regardless of provider)

* metadata_url - needs to be accessible by runners (regardless of provider)

* webhook_url - needs to be accessible by GitHub

This will allow you to use the same GARM instance with multiple providers like Azure, GCP, OpenStack, OCI, etc.

Thanks for the detailed explanation! . By any chance have you tried garm k8s operator using runner image in a private registry? Im trying to figure if the Image crd needs imagePullSecrets

gabriel-samfira commented 5 months ago

I have not tried, but I see there is an issue open here:

https://github.com/mercedes-benz/garm-provider-k8s/issues/6

You might try to add a comment there with your use case.

bavarianbidi commented 4 months ago

sorry, didn't follow the entire conversation here :see_no_evil:

@pathcl are there any other questions open regarding the garm-operator in combination with garm?