Open pathcl opened 5 months ago
Hi @pathcl
The scope of GARM is limited to successfully spinning up runners and making them available to the workflow jobs that are triggered on GitHub. Everything we add to GARM is geared towards that scope. But I'll explain on each point:
If there's a stuck workflow because of a failed runner/provider. I know we have a timeout for bootstrap
Indeed, this is something that can be addressed simultaneously with adding metrics to providers (see bellow). If the stuck workflow is a symptom of a stuck runner/provider, then that should be addressed by metrics added to providers. Otherwise, it's outside the scope of GARM to watch workflows themselves. We only care about the workflow jobs that we record. The distinction is important. We may not record all jobs for various reasons:
Sadly, there is no efficient way to fix any of the last 2 scenarios. We have orgs which may have many repos, and we have enterprises which may have many orgs which may have many repos. Workflows only exist at the repo level, so if we attempt to ingest any workflows we missed, it means hammering the GH API for workflows on potentially thousands of repos.
The only potential workaround is if the operator of GARM knows that GARM/GitHub was down for a while and missed some jobs, they can increase min-idle-runners to match max-runners until they consume the queue on their github repos and then set min-idle-runners back to its original value.
What's the P99/P90 for jobs&runners, startup time.
This is one area where we need to improve. We have metrics for the GH API calls, but no metrics for provider calls. We don't currently see if a runner just failed to reach idle
state and is just recreated over and over due to the bootstrap timeout. So yes, this needs to be addressed. Would you be willing to open an issue about this?
Get better insights about jobs. It should be possible to log/report about webhook events.
We have that. If you look at the function you highlighted above, you should see them in the body of the function.:
https://github.com/cloudbase/garm/blob/8f0d44742e3fcae1746b75899c132881b7b4ada1/apiserver/controllers/controllers.go#L112-L115 https://github.com/cloudbase/garm/blob/8f0d44742e3fcae1746b75899c132881b7b4ada1/apiserver/controllers/controllers.go#L120-L123 https://github.com/cloudbase/garm/blob/8f0d44742e3fcae1746b75899c132881b7b4ada1/apiserver/controllers/controllers.go#L125-L128 https://github.com/cloudbase/garm/blob/8f0d44742e3fcae1746b75899c132881b7b4ada1/apiserver/controllers/controllers.go#L134-L137
We can improve on this. If you have any suggestions in regards to what extra info you believe would make sense, we can find a way to add it.
Github actions doesn't provide a retry-mechanism. How do we cope with it?
That is something that we can't fix. The scope of GARM is to make a runner available to a workflow job. As long as we receive a webhook for a queued
job that we can handle, we go ahead and try to create a runner. But retrying jobs is something that should be done by the repo maintainer. Any retry attempt will generate new jobs, which will get to GARM and GARM will do the right thing (hopefully).
[...] We have metrics for the GH API calls, but no metrics for provider calls [... ]
we are able to "calculate" some kind of an error rate when it came to provider-interaction (this metric is part of the runner_
scope) - please see my comment here
slightly off topic, but somehow related to this discussion here. I can't share the exact code here (will discuss if possible - but it's not that complicated if you read my next few words :sweat_smile: ):
We are operating garm on an enterprise level. To make this work, we are receiving every action event from github (according to the garm documentation).
With that, we get a lot of events, even those we are not responsible for. To get more insights about our users/customers and the information we already have in the event payload, we are using this information by storing it into a database.
To do custom operation with the event
requests, we installed traefik
in front of garm
.
With that we are able to use the mirroring
feature of traefik (xref) and sending the traffic to garm
and also to our custom piece of code to parse the event
payload and store information like org/repo, job started, job ended, job queued, job status, ...
with that we are e..g able to see how delayed github send events to our system.
Hello folks,
One the challenges about runners and github actions after years it's still observability.
I'd like to know if we have plans to work on o11y for garm's webhook. https://github.com/cloudbase/garm/blob/8f0d44742e3fcae1746b75899c132881b7b4ada1/apiserver/controllers/controllers.go#L98
Use case(s)