actions / actions-runner-controller

Kubernetes controller for GitHub Actions self-hosted runners
Apache License 2.0
4.76k stars 1.12k forks source link

Metrics endpoint is not available on listeners that don't have active runners #3784

Closed velkovb closed 2 weeks ago

velkovb commented 4 weeks ago

Checks

Controller Version

0.9.3

Deployment Method

Helm

Checks

To Reproduce

1. Go to your prometheus endpoint.
2. Got to targets and get the target for your listener pods.
3. All listeners that don't have running jobs appear as down.

Describe the bug

We have 24 runners types and deploy 24 different scalesets. Only a few of them have warp (minRunners) enabled. Only those show up in Prometheus as live targets. The ones that don't have runners appear as down. Not sure if this could be related to - https://github.com/actions/actions-runner-controller/pull/3445/files

This leads to an issue with the gha_registered_runners metrics as it doesn't properly go to 0. For example if we have 20 warm runners and we scale them down for the night to 0 the gha_registered_runners stays at 20 as the metrics endpoint never starts to get the 0 value.

P.S. We observed another issue caused by this. When a Scale set configured with 0 min runners receives a job request it spins up new runner pods but doesn't activate the metrics endpoint thus leading to missing metrics.

Describe the expected behavior

All listener should be reporting metrics even if there are no active jobs on them.

Additional Context

affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchExpressions:
        - key: role
          operator: In
          values:
          - spot
fullnameOverride: gha-runner-scale-set-controller
metrics:
  controllerManagerAddr: :8080
  listenerAddr: :8081
  listenerEndpoint: /metrics
priorityClassName: system-cluster-critical
replicaCount: 1
resources:
  limits:
    memory: 512Mi
  requests:
    cpu: 200m
    memory: 512Mi
tolerations:
- effect: NoSchedule
  key: pot
  operator: Equal
  value: "true"

Controller Logs

https://gist.github.com/velkovb/eccd211ad8776ce4f63d55c61b954879

Runner Pod Logs

no runners involved here
github-actions[bot] commented 4 weeks ago

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

velkovb commented 2 weeks ago

This seems to have strangely disappeared. Closing for now.