Closed SystemKeeper closed 1 year ago
Was garm restarted by any chance while the runners were being bootstrapped?
Running
lxc list | grep garm- | wc -l
counts 82 containers
This is concerning. Garm should clean up any runners from the provider if they are no longer on github. And should keep trying if it fails. I will investigate. Would it be possible to provide the full log (anonymized)? You could send it via email if you prefer. I would like to see what happened.
Would it be possible to provide the full log (anonymized)?
The full Garm log or the full syslog from the container or both?
Was garm restarted by any chance while the runners were being bootstrapped?
I don’t think it was, let me check if I can find something.
The full Garm log or the full syslog from the container or both?
Full garm log. Sometimes I forget to use my outer voice :smile: . There should not be any sensitive info in the log, but if you spot anything, feel free to redact.
2023/06/02 08:10:05 Runner instance for garm-xKyVF4h7UsVp is no longer on the provider, removing from github
This line is also interesting. The garm instance only shows up on github as a result of the runner actually starting up and running the self hosted runner app. We can see that in the console log.
The line I quoted denotes that garm could no longer find the actual VM/container in LXD. It saw it on github, looked for it in the provider and the provider (lxd in this case) returned a result that indicated the VM/containerd didn't exist. Which is absolutely should exist. Will look at the code.
If you build garm using:
make build-static
Would you mind also running:
garm -version
We currently did build only via go install
. The repo checkout is at this commit:
commit 702937f63602c6691e977857fa64dcde25551de0 (HEAD -> main, origin/main, origin/HEAD)
Author: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
Date: Thu Mar 30 09:00:46 2023 +0000
Add github runner group in pool show
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
I can try static build and version later if that would help
You could send it via email if you prefer.
Send by mail, thanks :)
The line I quoted denotes that garm could no longer find the actual VM/container in LXD. It saw it on github, looked for it in the provider and the provider (lxd in this case) returned a result that indicated the VM/containerd didn't exist
I See in the code that the there's a cache for the instances used. Could that be a problem somehow? Like outdated cache, concurrency issue or something the like?
The VM/container was created almost 5 minutes prior to that check. The cache is created every time cleanupOrphanedGithubRunners()
is run. So that cache was created almost 5 minutes after the instance was created and should include all instances. Will check. The fact that you have a number of leftover VM/containers is also strange.
What version of LXD are you using. I want to have as much info to try to reproduce this.
I think I may have found the issue (embarrassing as it is):
https://github.com/cloudbase/garm/pull/92
Would you mind pulling the latest commit and running a:
make build-static
Running on Ubuntu 22.04 with LXD snap install (5.14) with ZFS pool and shiftfs enabled. Runners have a profile with 2 CPUs, 7GB RAM and 15 GB Disk (as with the default GitHub action runners).
Pool config:
| ID | f52a63ee-7af9-4a19-b7f0-4f1f67de1579 |
| Provider Name | lxd_local |
| Image | gh-ubuntu22-20230507 |
| Flavor | runner |
| OS Type | linux |
| OS Architecture | amd64 |
| Max Runners | 20 |
| Min Idle Runners | |
| Runner Bootstrap Timeout | |
| Tags | self-hosted, x64, Linux, ubuntu-latest, ubuntu-22.04 |
| Belongs to | ......... |
| Level | org |
| Enabled | true |
| Runner Prefix | garm |
| Extra specs | |
| GitHub Runner Group | |
Would you mind pulling the latest commit and running a:
make build-static
Sure, thanks again!
Out of curiosity: Any reason to prefer make build-static
vs. just having the go install?
Using make build-static
will build against musl
on alpine
. That binary does not depend on glibc
and is fully static. It can run on any Linux system, regardless of glibc
version. We can't really compile a fully static binary against glibc
if gethostbyname()
is involved and that can lead to segmentation faults if we build on newer versions of glibc, and try to run the binary on systems with an older version that may be ABI incompatible.
Normally this is not an issue. In most cases you don't have to run your binary on an ancient version of Linux, but there are some environments out there that still run CentOS 6/7.
I See, thanks for the explanation. Right now we do not have docker installed on the host machine, that's why we did not use make build-static
in the first place. I'll put this on my todo.
Podman also works (in case you don't want a running daemon). You can build it on any machine. You can of course also use go install, or:
go build -mod vendor \
-o garm -tags osusergo,netgo,sqlite_omit_load_extension \
-ldflags \
" -s -w -X main.Version=$(git describe --always --dirty)" ./cmd/garm
I removed the link mode external bit, as you don't need it, but the rest will give you a smaller binary and also add the version to the binary.
So, containers deleted, garm updated, let's see what happens :)
Ohh. It would have been nice to leave the containers and see if garm cleans them up 😄 . No worries. Ideally this bug is gone.
Oh, I thought this would not work 😅
Baring any silly bugs in the provider (like the one just fixed) it should do it's best to cleanup both in github and in the provider.
If you manually delete a runner from lxd, after about 5 minutes it runs a cleanup function that detects orphaned runners and cleans them up from github. Same is true if you manually remove a runner from github.
Looking good after one day:
# lxc list | grep garm- | wc -l
20
Thanks again!
Awesome! Feel free to open a new issue if you spot any weirdness.
We are using LXD as a provider with garm and have max runners set to 20. After a few days you notice the following:
gram-cli pool show ...
lists 20 runner instanceslxc list | grep garm- | wc -l
counts 82 containersWhen looking into the garm log I see the following for one of the "ghost containers":
But the container is still on the provider:
When looking into the syslog of the container, I see this:
So at first glance it looks like garm checked the runners on GitHub while the service inside the container was restarting? But garm should be able to detect that the container is still running on LXD. From the source code garm is checking against the provider, so not sure what could fail here?