garm webhook && metrics/o11y

pathcl commented 5 months ago

Hello folks,

One the challenges about runners and github actions after years it's still observability.

I'd like to know if we have plans to work on o11y for garm's webhook. https://github.com/cloudbase/garm/blob/8f0d44742e3fcae1746b75899c132881b7b4ada1/apiserver/controllers/controllers.go#L98

Use case(s)

If there's a stuck workflow because of a failed runner/provider. I know we have a timeout for bootstrap
What's the P99/P90 for jobs&runners, startup time
Get better insights about jobs. It should be possible to log/report about webhook events.
Github actions doesn't provide a retry-mechanism. How do we cope with it?

gabriel-samfira commented 5 months ago

Hi @pathcl

The scope of GARM is limited to successfully spinning up runners and making them available to the workflow jobs that are triggered on GitHub. Everything we add to GARM is geared towards that scope. But I'll explain on each point:

If there's a stuck workflow because of a failed runner/provider. I know we have a timeout for bootstrap

Indeed, this is something that can be addressed simultaneously with adding metrics to providers (see bellow). If the stuck workflow is a symptom of a stuck runner/provider, then that should be addressed by metrics added to providers. Otherwise, it's outside the scope of GARM to watch workflows themselves. We only care about the workflow jobs that we record. The distinction is important. We may not record all jobs for various reasons:

There is no pool to handle it, so we don't care
GARM was down when that event was generated.
GitHub was down and webhooks were never sent out (happens more often than one might think)

Sadly, there is no efficient way to fix any of the last 2 scenarios. We have orgs which may have many repos, and we have enterprises which may have many orgs which may have many repos. Workflows only exist at the repo level, so if we attempt to ingest any workflows we missed, it means hammering the GH API for workflows on potentially thousands of repos.

The only potential workaround is if the operator of GARM knows that GARM/GitHub was down for a while and missed some jobs, they can increase min-idle-runners to match max-runners until they consume the queue on their github repos and then set min-idle-runners back to its original value.

What's the P99/P90 for jobs&runners, startup time.

This is one area where we need to improve. We have metrics for the GH API calls, but no metrics for provider calls. We don't currently see if a runner just failed to reach idle state and is just recreated over and over due to the bootstrap timeout. So yes, this needs to be addressed. Would you be willing to open an issue about this?

Get better insights about jobs. It should be possible to log/report about webhook events.

We have that. If you look at the function you highlighted above, you should see them in the body of the function.:

https://github.com/cloudbase/garm/blob/8f0d44742e3fcae1746b75899c132881b7b4ada1/apiserver/controllers/controllers.go#L112-L115 https://github.com/cloudbase/garm/blob/8f0d44742e3fcae1746b75899c132881b7b4ada1/apiserver/controllers/controllers.go#L120-L123 https://github.com/cloudbase/garm/blob/8f0d44742e3fcae1746b75899c132881b7b4ada1/apiserver/controllers/controllers.go#L125-L128 https://github.com/cloudbase/garm/blob/8f0d44742e3fcae1746b75899c132881b7b4ada1/apiserver/controllers/controllers.go#L134-L137

We can improve on this. If you have any suggestions in regards to what extra info you believe would make sense, we can find a way to add it.

Github actions doesn't provide a retry-mechanism. How do we cope with it?

That is something that we can't fix. The scope of GARM is to make a runner available to a workflow job. As long as we receive a webhook for a queued job that we can handle, we go ahead and try to create a runner. But retrying jobs is something that should be done by the repo maintainer. Any retry attempt will generate new jobs, which will get to GARM and GARM will do the right thing (hopefully).

bavarianbidi commented 4 months ago

[...] We have metrics for the GH API calls, but no metrics for provider calls [... ]

we are able to "calculate" some kind of an error rate when it came to provider-interaction (this metric is part of the runner_ scope) - please see my comment here

bavarianbidi commented 4 months ago

slightly off topic, but somehow related to this discussion here. I can't share the exact code here (will discuss if possible - but it's not that complicated if you read my next few words :sweat_smile: ):

We are operating garm on an enterprise level. To make this work, we are receiving every action event from github (according to the garm documentation).

With that, we get a lot of events, even those we are not responsible for. To get more insights about our users/customers and the information we already have in the event payload, we are using this information by storing it into a database. To do custom operation with the event requests, we installed traefik in front of garm. With that we are able to use the mirroring feature of traefik (xref) and sending the traffic to garm and also to our custom piece of code to parse the event payload and store information like org/repo, job started, job ended, job queued, job status, ...

with that we are e..g able to see how delayed github send events to our system.

cloudbase / garm

garm webhook && metrics/o11y #272