cloudbase / garm

GitHub Actions Runner Manager
Apache License 2.0
138 stars 26 forks source link

extend metrics for github and provider executions #217

Closed bavarianbidi closed 9 months ago

bavarianbidi commented 9 months ago

This PR add couple of new metrics to count interactions with the github API and external provider binaries.

i've followed the recommendation of the prometheus-folks when it came to metrics for this kind of problem: https://promlabs.com/blog/2023/09/19/errors-successes-totals-which-metrics-should-i-expose-to-prometheus/#recommended-for-binary-outcomes-exposing-errors-and-totals

provider binary example metrics:

# HELP garm_runner_operations_total Total number of instance operation attempts
# TYPE garm_runner_operations_total counter
garm_runner_operation_total{operation="CreateInstance",provider="kubernetes_external"} 2
garm_runner_operation_total{operation="DeleteInstance",provider="kubernetes_external"} 2

github api call example metrics:

# HELP garm_github_errors_total Total number of failed github operation attempts
# TYPE garm_github_errors_total counter
garm_github_errors_total{operation="GenerateOrgJITConfig",scope="Organization"} 2
# HELP garm_github_operations_total Total number of github operation attempts
# TYPE garm_github_operations_total counter
garm_github_operations_total{operation="GenerateOrgJITConfig",scope="Organization"} 2
garm_github_operations_total{operation="ListOrganizationRunnerApplicationDownloads",scope="Organization"} 2
garm_github_operations_total{operation="ListOrganizationRunners",scope="Organization"} 1
garm_github_operations_total{operation="RemoveRunner",scope="Organization"} 4

towards #181

TODO:

gabriel-samfira commented 9 months ago

This looks good to me. All this makes me think we might benefit from some telemetry as well. Maybe in a future release we could leverage OpenTelemetry.

I could merge this if you're satisfied with it. Let me know. If you're still working on it, feel free to convert it to draft so I don't accidentally merge it prematurely :grin:

bavarianbidi commented 9 months ago

This looks good to me. All this makes me think we might benefit from some telemetry as well. Maybe in a future release we could leverage OpenTelemetry.

I could merge this if you're satisfied with it. Let me know. If you're still working on it, feel free to convert it to draft so I don't accidentally merge it prematurely 😁

OpenTelemetry is definitely (IMHO) the next thing when it came to observability. TBH i didn't spend that much time into it to understand how/what we need on our side to use it within our company internal managed prometheus service. But as you said, something totally different and unrelated to this PR.

Will convert the PR to draft to finalize it tomorrow morning (especially the second round of looking for external api-calls).

gabriel-samfira commented 9 months ago

OpenTelemetry is definitely (IMHO) the next thing when it came to observability. TBH i didn't spend that much time into it to understand how/what we need on our side to use it within our company internal managed prometheus service. But as you said, something totally different and unrelated to this PR.

definitely something for future releases, not even for the next one. otel allows us to trace calls throughout the codebase and allows us to add context to those traces, like what we're calling functions with, what error they return. With that info, we can then generate detailed graphs of how a call traveled throughout the code base. Further more, this can be correlated even with external providers.

But again, this is for much further into the future. For now, and especially in this PR, metrics are the focus.

bavarianbidi commented 9 months ago

I've found three other API calls i missed in the first round - but for now it looks good - feel free to merge whenever you want