Auto scaling not taking into account whether a repo has access to a runner group.

Hdom commented 8 months ago

Hello, I have found that the current auto scaling mechanism is not taking into account whether a repo is registered to the runner group specified by an org pool.

Currently we have many repos that are not registered under the runner group specified for pools, when any of those repos have a queued up workflow job it triggers the auto scale because they are in the same org and the tags match.

Unfortunately most of our existing workflows are set to self-hosted tag so they are triggering autoscaling of pools that should not be (since they aren't in the runner group they wont be picked up). This is causing all of our pools to constantly pin runner amount to max and recreate runners non stop (cleaned up by scaleDown loop).

As a temporary workaround I have created a fork of garm and have implemented the following:

Added a runner group map with a selected repos "cache" list, in memory, inside of the organization object.
```
type organization struct {
...
rgs         map[string]*common.GithubRunnerGroup
}
```

type GithubRunnerGroup struct { ID int64 Name string Visibility string SelectedRepositories github.ListRepositories }


2. On org pool manager startup, schedule job to periodically (10 mins) refresh selected repos list for every enabled pool.
    - Loops over all pools and generates `common.GithubRunnerGroup` object with runner group and selected repos info from ghcli.
4. On HandleWorkflowJob verify repo is in runner group's selected repos list
    - Loops over potential pools (reduced by tags)
      - On public or private visibility, return true
      - On selected visibility, run logic
6. Add api interface to execute selected-repo list update on a specific pool list
    - Will trigger this programatically when a repo is registered to a runner group

gabriel-samfira commented 8 months ago

Hmm. This is an interesting case. The ability to schedule a runner based on runner group is currently missing in GARM.

For example, we completely ignore workflows such as the following.

runs-on:
  group: my-runner-group

We need to be able to schedule a pool based on group as well as labels:

runs-on:
  group: my-runner-group
  labels: ["large", "self-hosted"]

And currently GARM doesn't do anything with the group name. We can register runners in a runner group, but if a workflow comes in with a runs-on set to the runner group name, we never react.

Do your workflows explicitly target groups or do they just use self-hosted and nothing else?

This week I'll be traveling, but next week I will definitely attempt to fix this huge missing feature. We already record the runner group name in the pool definition. There is no reason we can't schedule the proper pool when a job comes in that targets a specific pool. So even if you do use self-hosted, as long as you specify the runner group in your workflows, only pools that have that runner group set will be matched.

As a side note, I highly advise to avoid using any of the default labels and use unique label sets if at all possible for a better scheduling experience.

Hdom commented 8 months ago

Yeah unfortunately we already have too many repos with self-hosted as the only label so runner group control is the only way we have to prevent breaking those repos. Currently working on getting the repos updated but we don't own the majority of them so I can't go and make the change directly.

Any thoughts on the "selected repos" feature of runner groups?

gabriel-samfira commented 8 months ago

I'm guessing the repos that use self-hosted don't really target a runner group. You're just trying to restrict who gets to use a certain set of runners by granting granular control to one repo or another. And in the case detailed above, garm spins up runners in the wrong place which never really gets used due to ACLs on the groups.

So there are actually 3 features to potentially implement here:

1) the ability to target a pool based on runner group as well as labels. 2) adding some sort of ACLs in GARM itself which allows us define an allow list or an exclude list of repos for which a pool should be eligible 3) a way to sync that allow/deny list based on the runner group defined on a pool and the repos that are allowed to use it.

We need to figure out what the impact is (in term of API calls to GitHub) of keeping that allow list in sync with repos that have been granted access to a runner group. We also need to take into account that runner groups can be created at the enterprise level and shared with orgs, which in turn can share them with repos. This last aspect may be inconsequential, but we still need to explore it.

Using the runner group as a source of truth for 2) seems correct, but the fact that those runner groups can be updated at any time to allow/deny public repos or that the runner group can be shared with more repos at any time, without sending out a webhook, means we need to poll the runner groups as well. This is a potential pain in the neck, especially if you have huge orgs with many repos and many runner groups.

But this is something that definitely needs to be explored more. I can definitely have a look at points 1) and 2) starting next week. For 3) I need some time to think, and hopefully access to an enterprise account.

gabriel-samfira commented 8 months ago

Yeah. For enterprises, we can only see orgs we've shared the runner group with. If we have an enterprise with many orgs, and those orgs have many repos, we'd need to first list all orgs with access to the runner group, then for each org we'd need to see which repo they share that runner group with and add those repos to the list. It can potentially generate a lot of API requests, and it can be really slow.

Hdom commented 8 months ago

Yeah, didn't know that enterprises had runner groups, definitely adds an additional dimension of complexity to the solution.

You are correct the self-hosted don't target a runner group, they just rely on there not being any "global" runner groups and the built in runner group allow list.

For organization runner groups the api impact is 1 api call to get the runer group ID from runner group name (unless you cache the runner group ID somewhere) and 1 api call to get the list of repositories, more or less 2 * # of pools * frequency of update which for my workaround is around 300 api calls per hour which is fine but we currently have a fairly small selection of pools. It also has the side effect that repos have to wait up to 10 minutes to get picked up by a runner if they were just added to the allow list. Thats why I added an api interface to trigger the update for a specific pool. I am planning to build some automation to replace the manual allow list creation and it will trigger the GARM Api to update allow list after saving the runner group.

gabriel-samfira commented 8 months ago

This is a mess. It seems that jobs in queued state do not set the runner group, even if the workflow explicitly sets:

runs-on:
  group: my-runner-group
  labels: ["large", "self-hosted"]

So until a job is picked up, we don't know the runner group. This means that we can't really schedule by group name. We can still attempt to create that allowlist/blocklist of repos that we should react to when a job comes in. If the pool has an explicit runner group set and the entity requesting a runner is not allowed in any of the runner groups set on pools, we can ignore the job.

But we still have to solve the issue of API requests. Will think about how we can do this in a sane way.

Hdom commented 2 months ago

Hey @gabriel-samfira has there been any recent breakthroughs on this issue? We have a really inconvenient custom solution in place and would like to move away from it.

Just want to add that ontop of selected-repos, runner groups also have a selected-workflow functionality that is not being taken into account by the auto-scaler.

gabriel-samfira commented 2 months ago

sadly no. They still don't send the runner group as part of a queued webhook payload. We'd need to periodically brute force scan to update mappings. I don't have cycles to address this for now. Hopefully soon.

cloudbase / garm

Auto scaling not taking into account whether a repo has access to a runner group. #236