cloudbase / garm

GitHub Actions Runner Manager
Apache License 2.0
132 stars 25 forks source link

Take a maximum number of runners per provider into account #204

Open SystemKeeper opened 9 months ago

SystemKeeper commented 9 months ago

Just an idea that came to our mind and I wanted to share. Consider the following:

The problem is, there's now no way to specify that we are able to run 20 runners maximum. So when 10 runners are assigned to each org, all is good. On the other hand, when org A is rarely used, org B is limited to 10 runners, although most of the time more runners could be used, because resources are available. If I am using autoscaling and set org A to max 10 runners and org B to max 20 runners, this will work fine, as long as org A has no runners active. But when org A ramps up, I end up with 30 runners, which will be a problem for the system.

What I was thinking as a rough idea:

So given my example above, this would:

gabriel-samfira commented 8 months ago

Hi @SystemKeeper !

Sorry for the late reply.

For the most part, this sounds like a good idea. Especially for providers that deal with limited resources, like LXD, Incus, K8s and potentially future providers for various other systems.

The current architecture of GARM is a really simple one. It's a single process, single server app. It doesn't currently scale horizontally, at all. It could with some refactoring, but up until this point it wasn't really needed. At least not for performance reasons. However, there are plans to split it up in the future into multiple components:

At that point, I think we could start thinking about something along the lines of what you described. It wouldn't be impossible to have something like you described in the current code, but it would be difficult to add it in a way that would not make the code more difficult to decouple in the future. When we have a proper scheduler component, we can develop that further and potentially implement "filters" similar as a concept to what OpenStack has. A request could be passed through the scheduler which in turn would weigh that request using various filters that could be enabled in the scheduler and decide if a worker is returned to take care of the task, or if the request would be throttled for later re-queuing.

All of that, however is dramatically more complex than what currently exists in GARM, and is a large effort that will see us (probably) moving away from sqlite (even though there are some interesting projects out there that could help us stay with sqlite, but that would probably be like jamming a triangle in a square)

I will keep this open (potentially for a long time), but this is something that I acknowledge is useful in some cases.