cloudbase / garm-provider-gcp

Garm external provider for GCP
https://github.com/cloudbase/garm-provider-gcp
Apache License 2.0
1 stars 5 forks source link

What are permissions needed by provider in GCP / ability to use default application credentials #16

Closed gustaff-weldon closed 3 months ago

gustaff-weldon commented 4 months ago

Thanks for your work on GARM. I'm setting GARM up with the intention to use it with GCP compute runners.

The provider documentation is missing information what permissions (or roles) are necessary for the provider to work.

Also, currently it states that service account json key is needed, can we use default application credentials instead or Workload Identity Federation? Our organisation policy disallows creating service account keys.

gabriel-samfira commented 3 months ago

Hi @gustaff-weldon

The provider needs to be able to create VMs in your GCP account. For testing purposes, I used roles/compute.admin, but any set or roles that allow you to create an instance should do.

In terms of authentication to the API, we should be able to use whatever the golang SDK permits. We went with service account keys for now but other methods can be added. Do you plan to run GARM in GCP? I believe there is something similar to "managed identity" in Azure or "IAM roles" in EC2. Need to look it up.

gabriel-samfira commented 3 months ago

Ahh, so FindDefaultCredentials should do the trick. I will test it out and push a PR when it's done. Might take a while, though.

gustaff-weldon commented 3 months ago

hi @gabriel-samfira

to create VMs in your GCP account

ok, so it only creates vms, no instance templates, node pools etc necessary I assume, thanks!

Ahh, so FindDefaultCredentials should do the trick.

yes it would be great if garm could use default application credentials first and fallback to sa json file. Thanks for having a look at this.

Do you plan to run GARM in GCP?

yes, atm I'm evaluating and trying to setup locally with ngrok, but ultimately I will either start garm on Cloud Run, or maybe give a go the GKE provider and run it in k8s (we use workfload identity there)

I will test it out and push a PR when it's done. Might take a while, though.

Thank you!

Quick question, am I right that instance is spun on job requested webhook and should be terminated on job completed? (so runners are one time use and do you pass --ephemeral to github client)

gabriel-samfira commented 3 months ago

Quick question, am I right that instance is spun on job requested webhook and should be terminated on job completed? (so runners are one time use and do you pass --ephemeral to github client)

Yes. GARM only even spins up ephemeral runners. Persistent runners are not advised unless you absolutely trust the authors of a PR. Even then they should be treated adversarial as their systems may be compromised without their knowledge. So we prefer to keep our tin foil hats on for this one and only support ephemeral runners.

You have the option to keep a "warmed up" pool of runners by setting --min-idle-runners to a value larger than zero, in which case, GARM will attempt to always keep a number of runners in idle state, unless you reached your max-runners limit. Otherwise, GARM will spin up new runners on-demand as workflow jobs come in, in queued state. Once the job transitions to completed for whatever reason, the runner that was assigned to that job is terminated both from GARM and from the provider which ran the instance of that runner.

gustaff-weldon commented 3 months ago

Yes. GARM only even spins up ephemeral runners.

Perfect, we prefer to use ephemeral runners, as a security and cost-cutting measure. Glad to learn about the pre-warmed pool size, I will have a look, if spinning up runners proves to be slow on GCP (hope not).

gabriel-samfira commented 3 months ago

In my experience, if you have jobs that take a long time to run, an extra 5 minutes for the runner to spin up doesn't make much of a difference. It only kind of makes sense to have a warm pool with idle runners if you have jobs that run quickly, or if you have no extra cost when keeping the runner online, like if you'd be using the LXD/Incus providers or the k8s provider. In those cases, the LXD/Incus server is already running and incurring cost and so is the k8s cluster. In which case, you can potentially keep your min-idle-runners equal to your max-runners and the cost would be the same. But this is something that each team can experiment with and decide on the best course of action.

Have a look at the using garm guide. Keep in mind that the version in main applies to the code in main. For stable releases you need to switch to the proper tag to view the docs.

gustaff-weldon commented 3 months ago

We will definitely test.

gabriel-samfira commented 3 months ago

Just merged the default credentials PR. Give it a shot. It worked well in my tests, after I created a VM and gave it access to a service account.

gabriel-samfira commented 3 months ago

To use default credentials, leave the credentials_file option empty. If you want GARM to pass specific environment variables to the provider, you need to edit the GARM config file, and in the provider section, you need to add:

environment_variables = ["GOOGLE_APPLICATION_CREDENTIALS", "SOME_OTHER_VAR_YOU_WANT_TO_PASS"]

If you're running on GCP, you don't need to pass any variables. The VM will get its creds from metadata.

I updated the README. Give that a look.

gustaff-weldon commented 3 months ago

Thanks, atm I'm still starting this up the old way. Once I get it working I'm definitely going to try default credentials.

Is the provider invoked by garm manager on-a-request basis, meaning it could technically be a python script that calls gcloud SDK/CLI underneath and then exists?

Asking, since before giving permissions to service account for provider, I have started GARM with the gcp provider and was getting some errors that make me think provider is kept running all the time?

time=2024-06-14T11:20:01.415Z level=ERROR msg="failed to delete instance from provider" error="provider binary /garm-gcp/garm-provider-gcp returned error: provider binary failed with stdout: ; stderr: panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x748be4]

goroutine 1 [running]:
cloud.google.com/go/compute/apiv1.(*Operation).Wait(0x0, {0xbc9578, 0x400003c380}, {0x0, 0x0, 0x0})
\t/garm-provider/vendor/cloud.google.com/go/compute/apiv1/operations.go:54 +0x74
github.com/cloudbase/garm-provider-gcp/internal/client.(*GcpCli).DeleteInstance(0x40001226f0, {0xbc9578, 0x400003c380}, {0x4000042011?, 0x40006155f8?})
\t/garm-provider/internal/client/gcp.go:217 +0x184
github.com/cloudbase/garm-provider-gcp/provider.(*GcpProvider).DeleteInstance(0x40006155f8?, {0xbc9578?, 0x400003c380?}, {0x4000042011?, 0x20?})
\t/garm-provider/provider/provider.go:86 +0x2c
github.com/cloudbase/garm-provider-common/execution.Run({_, _}, {_, _}, {{0x400003e00d, 0xe}, {0x4000040013, 0x24}, {0x0, 0x0}, ...})
\t/garm-provider/vendor/github.com/cloudbase/garm-provider-common/execution/execution.go:185 +0x67c
main.main()
\t/garm-provider/main.go:50 +0x1d0
: exit status 2
removing instance
github.com/cloudbase/garm/runner/pool.(*basePoolManager).deleteInstanceFromProvider
\t/build/garm/runner/pool/pool.go:1412
github.com/cloudbase/garm/runner/pool.(*basePoolManager).retryFailedInstancesForOnePool.func1
\t/build/garm/runner/pool/pool.go:1280
golang.org/x/sync/errgroup.(*Group).Go.func1
\t/build/garm/vendor/golang.org/x/sync/errgroup/errgroup.go:75
runtime.goexit
\t/usr/local/go/src/runtime/asm_arm64.s:1222" runner_name=garm-GBBk3KHIgjlg
gustaff-weldon commented 3 months ago

@gabriel-samfira I'm getting errors also after adding permissions (used roles/compute.admin). I'm still using service account json for testing atm.

It looks like provider tries to create the instance (which does not work as I do not see any in cloud console), but I do not see any errors in GARM output related to creating an instance:

time=2024-06-14T11:38:23.667Z level=INFO msg="creating instance in pool" runner_name=garm-FW3mfjIRAst1 

Is immediately folllowed by:

pool_id=7f12713c-28ff-4bf5-ba2a-ee87c61f9f9d pool_mgr=mycompany/myrepo pool_type=repository
time=2024-06-14T11:38:24.672Z level=ERROR msg="failed to cleanup instance" error="provider binary /garm-gcp/garm-provider-gcp returned error: provider binary failed with stdout: ; stderr: panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x748be4]

goroutine 1 [running]:
cloud.google.com/go/compute/apiv1.(*Operation).Wait(0x0, {0xbc9578, 0x4000276880}, {0x0, 0x0, 0x0})
\t/garm-provider/vendor/cloud.google.com/go/compute/apiv1/operations.go:54 +0x74
github.com/cloudbase/garm-provider-gcp/internal/client.(*GcpCli).DeleteInstance(0x400000c300, {0xbc9578, 0x4000276880}, {0x4000042011?, 0x400063f5f8?})
\t/garm-provider/internal/client/gcp.go:217 +0x184
github.com/cloudbase/garm-provider-gcp/provider.(*GcpProvider).DeleteInstance(0x400063f5f8?, {0xbc9578?, 0x4000276880?}, {0x4000042011?, 0x20?})
\t/garm-provider/provider/provider.go:86 +0x2c
github.com/cloudbase/garm-provider-common/execution.Run({_, _}, {_, _}, {{0x400003e00d, 0xe}, {0x4000040013, 0x24}, {0x0, 0x0}, ...})
\t/garm-provider/vendor/github.com/cloudbase/garm-provider-common/execution/execution.go:185 +0x67c
main.main()
\t/garm-provider/main.go:50 +0x1d0
: exit status 2" provider_id=garm-FW3mfjIRAst1
time=2024-06-14T11:38:24.672Z level=ERROR msg="failed to add instance to provider" error="provider binary /garm-gcp/garm-provider-gcp returned error: provider binary failed with stdout: ; stderr: failed to run command: failed to create instance in provider: error getting instance: failed to get instance: googleapi: Error 404: The resource 'projects/prj-zen-c-ci-qztw/zones/europe-west4-a/instances/garm-fw3mfjirast1' was not found
: exit status 1
creating instance
github.com/cloudbase/garm/runner/pool.(*basePoolManager).addInstanceToProvider
\t/build/garm/runner/pool/pool.go:930
github.com/cloudbase/garm/runner/pool.(*basePoolManager).addPendingInstances.func1
\t/build/garm/runner/pool/pool.go:1546
runtime.goexit
\t/usr/local/go/src/runtime/asm_arm64.s:1222" runner_name=garm-FW3mfjIRAst1
time=2024-06-14T11:38:24.682Z level=ERROR msg="failed to create instance in provider" error="provider binary /garm-gcp/garm-provider-gcp returned error: provider binary failed with stdout: ; stderr: failed to run command: failed to create instance in provider: error getting instance: failed to get instance: googleapi: Error 404: The resource 'projects/prj-zen-c-ci-qztw/zones/europe-west4-a/instances/garm-fw3mfjirast1' was not found
: exit status 1
creating instance
github.com/cloudbase/garm/runner/pool.(*basePoolManager).addInstanceToProvider
\t/build/garm/runner/pool/pool.go:930
github.com/cloudbase/garm/runner/pool.(*basePoolManager).addPendingInstances.func1
\t/build/garm/runner/pool/pool.go:1546
runtime.goexit
\t/usr/local/go/src/runtime/asm_arm64.s:1222" runner_name=garm-FW3mfjIRAst1
time=2024-06-14T11:38:28.984Z level=ERROR msg="failed to delete instance from provider" error="provider binary /garm-gcp/garm-provider-gcp returned error: provider binary failed with stdout: ; stderr: panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x748be4]

goroutine 1 [running]:
cloud.google.com/go/compute/apiv1.(*Operation).Wait(0x0, {0xbc9578, 0x400017c480}, {0x0, 0x0, 0x0})
\t/garm-provider/vendor/cloud.google.com/go/compute/apiv1/operations.go:54 +0x74
github.com/cloudbase/garm-provider-gcp/internal/client.(*GcpCli).DeleteInstance(0x4000686348, {0xbc9578, 0x400017c480}, {0x4000042011?, 0x40005bf5f8?})
\t/garm-provider/internal/client/gcp.go:217 +0x184
github.com/cloudbase/garm-provider-gcp/provider.(*GcpProvider).DeleteInstance(0x40005bf5f8?, {0xbc9578?, 0x400017c480?}, {0x4000042011?, 0x20?})
\t/garm-provider/provider/provider.go:86 +0x2c
github.com/cloudbase/garm-provider-common/execution.Run({_, _}, {_, _}, {{0x400003e00d, 0xe}, {0x4000040013, 0x24}, {0x0, 0x0}, ...})
\t/garm-provider/vendor/github.com/cloudbase/garm-provider-common/execution/execution.go:185 +0x67c
main.main()
\t/garm-provider/main.go:50 +0x1d0
: exit status 2
removing instance
github.com/cloudbase/garm/runner/pool.(*basePoolManager).deleteInstanceFromProvider
\t/build/garm/runner/pool/pool.go:1412
github.com/cloudbase/garm/runner/pool.(*basePoolManager).retryFailedInstancesForOnePool.func1
\t/build/garm/runner/pool/pool.go:1280
golang.org/x/sync/errgroup.(*Group).Go.func1
\t/build/garm/vendor/golang.org/x/sync/errgroup/errgroup.go:75
runtime.goexit

It looks to like sth goes wrong when adding instance (which I cannot see from logs), then provider tries to get info about instance, gets 404, then tries to clean up and fails (as it does not exist)

It might be worth reporting a separate issue, happy to do that.

gabriel-samfira commented 3 months ago

The error you're getting is a bug. It should not die like that because of a 404. I have a PR here: https://github.com/cloudbase/garm-provider-gcp/pull/19 which I will merge as soon as the tests finish.

Is the provider invoked by garm manager on-a-request basis, meaning it could technically be a python script that calls gcloud SDK/CLI underneath and then exists?

The provider is just an executable that gets exec-ed with some environment variable set and in the case of CreateInstance, some stdin.

See: https://github.com/cloudbase/garm/blob/main/doc/external_provider.md

It might be a bit outdated and sparse on info. But essentially, the provider can be anything as long as it's an executable, it respects the external provider interface and you point GARM to the executable. It can be bash, python, etc. Doesn't matter.

The providers written in Go all use this common scaffolding: https://github.com/cloudbase/garm-provider-common/blob/d0fe67934a5bcb773503553555274080ba60a852/execution/execution.go#L150-L204

This is the interface that external providers need to implement:

https://github.com/cloudbase/garm-provider-common/blob/d0fe67934a5bcb773503553555274080ba60a852/execution/interface.go#L26-L41

This is where GARM executes the provider:

https://github.com/cloudbase/garm/blob/4c7c9b0e1e4f7a62cd899d26786416c9618f08c3/runner/providers/external/external.go#L70

gabriel-samfira commented 3 months ago

when GARM fails to create an instance, the DeleteInstance command is executed to attempt a cleanup. In the above error, that cleanup failed with a nil pointer de-reference. The PR that merged, should take care of that error.

gabriel-samfira commented 3 months ago

try building the latest main branch. I suspect that the real error is masked by the nil pointer bug. After you rebuild main, if it fails again, do a:

garm-cli runner show <runner name>

If the runner is in error state, you should see the provider error there.

gustaff-weldon commented 3 months ago

when GARM fails to create an instance, the DeleteInstance command is executed to attempt a cleanup.

Yeah, I figured as much. I'm trying to find out why it failed to create an instance, I cannot see anything useful in the logs of GARM after that line

time=2024-06-14T12:37:36.661Z level=INFO msg="creating instance in pool" runner_name=garm-AG8Emk3MrwFs pool_id=27d1e91d-d695-4364-94e1-199272c90996 pool_mgr=redactedorg/redactedrepo pool_type=repository

From logs it looks like it fails to create an instance, because it cannot find it afterwards:

time=2024-06-14T12:37:37.517Z level=ERROR msg="failed to add instance to provider" error="provider binary /garm-gcp/garm-provider-gcp returned error: provider binary failed with stdout: ; stderr: failed to run command: failed to create instance in provider: error getting instance: failed to get instance: googleapi: Error 404: The resource 'projects/prj-redacted/zones/europe-west4-a/instances/garm-ag8emk3mrwfs' was not found\n: exit status 1\ncreating instance\ngithub.com/cloudbase/garm/runner/pool.(*basePoolManager).addInstanceToProvider\n\t/build/garm/runner/pool/pool.go:930\ngithub.com/cloudbase/garm/runner/pool.(*basePoolManager).addPendingInstances.func1\n\t/build/garm/runner/pool/pool.go:1546\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_arm64.s:1222" runner_name=garm-AG8Emk3MrwFs
time=2024-06-14T12:37:37.528Z level=ERROR msg="failed to create instance in provider" error="provider binary /garm-gcp/garm-provider-gcp returned error: provider binary failed with stdout: ; stderr: failed to run command: failed to create instance in provider: error getting instance: failed to get instance: googleapi: Error 404: The resource 'projects/prj-redacted/zones/europe-west4-a/instances/garm-ag8emk3mrwfs' was not found\n: exit status 1\ncreating instance\ngithub.com/cloudbase/garm/runner/pool.(*basePoolManager).addInstanceToProvider\n\t/build/garm/runner/pool/pool.go:930\ngithub.com/cloudbase/garm/runner/pool.(*basePoolManager).addPendingInstances.func1\n\t/build/garm/runner/pool/pool.go:1546\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_arm64.s:1222" runner_name=garm-AG8Emk3MrwFs

try building the latest main branch. I suspect that the real error is masked by the nil pointer bug.

I'm using latest main, I can see the runner in list and pending, the error in show is pretty much the same as above

/ # garm-cli runner list -a
+----+-------------------+--------+---------------+--------------------------------------+
| NR | NAME              | STATUS | RUNNER STATUS | POOL ID                              |
+----+-------------------+--------+---------------+--------------------------------------+
|  1 | garm-AG8Emk3MrwFs | error  | pending       | 27d1e91d-d695-4364-94e1-199272c90996 |
+----+-------------------+--------+---------------+--------------------------------------+
/ # garm-cli runner show garm-AG8Emk3MrwFs
+-----------------+------------------------------------------------------------------------------------------------------+
| FIELD           | VALUE                                                                                                |
+-----------------+------------------------------------------------------------------------------------------------------+
| ID              | 8cdb36e8-996a-4e9c-a3d4-10d0281ee10a                                                                 |
| Provider ID     |                                                                                                      |
| Name            | garm-AG8Emk3MrwFs                                                                                    |
| OS Type         | linux                                                                                                |
| OS Architecture | amd64                                                                                                |
| OS Name         |                                                                                                      |
| OS Version      |                                                                                                      |
| Status          | error                                                                                                |
| Runner Status   | pending                                                                                              |
| Pool ID         | 27d1e91d-d695-4364-94e1-199272c90996                                                                 |
| Provider Fault  | creating instance: provider binary /garm-gcp/garm-provider-gcp returned error: provider binary faile |
|                 | d with stdout: ; stderr: failed to run command: failed to create instance in provider: error getting |
|                 |  instance: failed to get instance: googleapi: Error 404: The resource 'projects/prj-redacted/zo |
|                 | nes/europe-west4-a/instances/garm-ag8emk3mrwfs' was not found                                        |
|                 | : exit status 1                                                                                      |
+-----------------+------------------------------------------------------------------------------------------------------+
gustaff-weldon commented 3 months ago

My configs:

[[provider]]

name = "gcp_external"
description = "external gcp provider"
provider_type = "external"

  [provider.external]
  # config file passed to the executable via GARM_PROVIDER_CONFIG_FILE environment variable
  config_file = "/etc/garm/providers.d/gcp/config.toml"

  # Absolute path to an executable that implements the provider logic. This executable can be
  # anything (bash, a binary, python, etc). See documentation in this repo on how to write an
  # external provider.
  provider_executable = "/garm-gcp/garm-provider-gcp"

and provider:

project_id = "prj-redacted"
zone = "europe-west4-a"
network_id = "projects/prj-redacted/global/networks/vpc-c-shared"
subnetwork_id = "projects/prj-redacted/regions/europe-west4/subnetworks/sb-c-shared-europe-west4"
credentials_file = "/garm-gcp/gcp-provider-sa-key.json"
external_ip_access = true

I really appreciate your help, I really like the idea behind GARM and would love to make it work. I have already tested creating an instance via UI, I will try via cli impersonating the service account garm provider uses and see if it gives me any error.

gabriel-samfira commented 3 months ago

Let's try the service account way:

Create a new service account:

gcloud iam service-accounts create garm-vm

Grant the needed roles:

gcloud projects add-iam-policy-binding prj-redacted \
    --member="serviceAccount:garm-vm@prj-redacted.iam.gserviceaccount.com" \
    --role=roles/compute.instanceAdmin.v1

gcloud projects add-iam-policy-binding prj-redacted \
    --member="serviceAccount:garm-vm@prj-redacted.iam.gserviceaccount.com" \
    --role=roles/iam.serviceAccountUser

gcloud projects add-iam-policy-binding prj-redacted \
    --member="serviceAccount:garm-vm@prj-redacted.iam.gserviceaccount.com" \
    --role=roles/iam.serviceAccountTokenCreator

gcloud iam service-accounts add-iam-policy-binding garm-vm@prj-redacted.iam.gserviceaccount.com  \
    --member="user:yourGCPUser@example.com" \
    --role=roles/iam.serviceAccountUser

Create a VM in GCP using this role:

gcloud compute instances create garm-vm \
    --service-account=garm-vm@prj-redacted.iam.gserviceaccount.com \
    --scopes=https://www.googleapis.com/auth/cloud-platform \
    --image=ubuntu-pro-2404-noble-amd64-v20240607 \
    --image-project=ubuntu-os-pro-cloud \
    --zone=europe-west1-c \
    --machine-type=e2-small

That VM should now have access to the project you're using and the GCP provider should work with a pool like:

garm-cli pool add --repo REPO_ID \
    --enabled true \
    --provider-name=gcp \
    --flavor=e2-small \
    --image=projects/debian-cloud/global/images/debian-11-bullseye-v20240110 \
    --min-idle-runners 1 --tags gcp,linux
gabriel-samfira commented 3 months ago

Apropos, if it's easier, you can also find me on slack.

gabriel-samfira commented 3 months ago

gah. Found it. The error was indeed masked in create instance as well. Really sorry about the head ache.

gabriel-samfira commented 3 months ago

See: https://github.com/cloudbase/garm-provider-gcp/pull/20

gustaff-weldon commented 3 months ago

Apropos, if it's easier, you can also find me on slack.

I might do that as well, I think I found what can be going wrong.

> gcloud compute instances create garm-vm \
    --project prj-redacted \
    --service-account=sa-garm-vm@prj-redacted.iam.gserviceaccount.com \
    --scopes=https://www.googleapis.com/auth/cloud-platform \
    --image=ubuntu-pro-2404-noble-amd64-v20240607 \
    --image-project=ubuntu-os-pro-cloud \
    --zone=europe-west4-a \
    --network=projects/prj-redacted/global/networks/vpc-c-shared \
    --subnet=projects/prj-redacted/regions/europe-west4/subnetworks/sb-c-shared-europe-west4 \
    --machine-type=e2-small

ERROR: (gcloud.compute.instances.create) Could not fetch resource:
 - Constraint constraints/compute.vmExternalIpAccess violated for project 123345678989. Add instance projects/prj-redacted/zones/europe-west4-a/instances/garm-vm to the constraint to use external IP with it.

Instance cannot be created with external IP address, apparently we have a policy that restricts this. I assume public IP is necessary for github to be able to access the runners.

I will look into sorting out the constraint, but the bigger question is why GARM logs did not show this in the logs, while trying to create the actual instance.

gabriel-samfira commented 3 months ago

You can set external_ip_access to false in your provider config if that is the case. The public IP for the runner is not needed at all. It's more for debugging purposes if you need to access the runner VM.

GARM itself needs to be accessible by github and the runners, so either a public IP or ngrok will work.

gustaff-weldon commented 3 months ago

See: https://github.com/cloudbase/garm-provider-gcp/pull/20

@gabriel-samfira thanks for the above fix.It helped me uncover some other permission issues when creating the instance Finally I got GARM to spawn instances in our GCP project.

You can set external_ip_access to false in your provider config if that is the case. The public IP for the runner is not needed at all. It's more for debugging purposes if you need to access the runner VM.

I have allowed public ip (at least for now), as I suspect I might need some debugging access.

Atm, the runners are spawn

Screenshot 2024-06-14 at 16 17 48

But are registered as offline and do not pick up jobs:

Screenshot 2024-06-14 at 16 17 32

eg:

Requested labels: self-hosted, garm-e2-medium
Job defined at: orgredacted/reporedacted/.github/workflows/ci.yml@refs/heads/pla-2228-test-github-actions-runner-manager-garm
Waiting for a runner to pick up this job...

I have followed the quick start and my pool looks like this:

+--------------------------+--------------------------------------------------------------------+
| FIELD                    | VALUE                                                              |
+--------------------------+--------------------------------------------------------------------+
| ID                       | 27d1e91d-d695-4364-94e1-199272c90996                               |
| Provider Name            | gcp_external                                                       |
| Image                    | projects/ubuntu-os-cloud/global/images/ubuntu-2204-jammy-v20240614 |
| Flavor                   | e2-medium                                                          |
| OS Type                  | linux                                                              |
| OS Architecture          | amd64                                                              |
| Max Runners              | 5                                                                  |
| Min Idle Runners         | 0                                                                  |
| Runner Bootstrap Timeout | 20                                                                 |
| Tags                     | garm-e2-medium, Linux, self-hosted, x64                            |
| Belongs to               | redactedorg/redactedrepo                                           |
| Level                    | repo                                                               |
| Enabled                  | true                                                               |
| Runner Prefix            | garm                                                               |
| Extra specs              |                                                                    |
| GitHub Runner Group      |                                                                    |
| Instances                | garm-pZBcvM9HuaO3 (4089782a-6bf2-4234-96f7-530d69e62b49)           |
|                          | garm-AilCUrjiJynR (21452d8c-bed3-437d-ad39-296afd2b71bb)           |
+--------------------------+--------------------------------------------------------------------+

All runners show up as pending:

/ # garm-cli runner list -a
+----+-------------------+---------+---------------+--------------------------------------+
| NR | NAME              | STATUS  | RUNNER STATUS | POOL ID                              |
+----+-------------------+---------+---------------+--------------------------------------+
|  1 | garm-pZBcvM9HuaO3 | running | pending       | 27d1e91d-d695-4364-94e1-199272c90996 |
+----+-------------------+---------+---------------+--------------------------------------+
|  2 | garm-AilCUrjiJynR | running | pending       | 27d1e91d-d695-4364-94e1-199272c90996 |
+----+-------------------+---------+---------------+--------------------------------------+
|  3 | garm-Ginn45Wxi6YO | running | pending       | 27d1e91d-d695-4364-94e1-199272c90996 |
+----+-------------------+---------+---------------+--------------------------------------+
|  4 | garm-nPpAVEmdHX90 | running | pending       | 27d1e91d-d695-4364-94e1-199272c90996 |
+----+-------------------+---------+---------------+--------------------------------------+
/ #

And do not show any errors via cli:

/ # garm-cli runner show garm-nPpAVEmdHX90
+-----------------+--------------------------------------+
| FIELD           | VALUE                                |
+-----------------+--------------------------------------+
| ID              | 7d6f737e-466c-410d-b20c-3236f572c44b |
| Provider ID     | garm-nppavemdhx90                    |
| Name            | garm-nPpAVEmdHX90                    |
| OS Type         | linux                                |
| OS Architecture | amd64                                |
| OS Name         |                                      |
| OS Version      |                                      |
| Status          | running                              |
| Runner Status   | pending                              |
| Pool ID         | 27d1e91d-d695-4364-94e1-199272c90996 |
+-----------------+--------------------------------------+

I will try ssh-ing into runner to see if I can peek into github client logs, but tips would be welcome

gabriel-samfira commented 3 months ago

The runner you see in Github are offline due to the fact that GARM uses JIT runners. This means that GARM creates them in GitHub beforehand and saves the credentials for them. Those credentials are transfered to the instances that become the actual runners.

In most cases, the fact that instances never transition from pending to installing and then idle is because the VMs/Servers can't reach the GARM metadata_url and callback_url. Once that happens, it should be able to fetch their credentials, configure themselves and come online.