actions / actions-runner-controller

Kubernetes controller for GitHub Actions self-hosted runners
Apache License 2.0
4.74k stars 1.12k forks source link

[gha-runner-scale-set-controller] metrics not exposed for the listener #3510

Closed isatfg closed 6 months ago

isatfg commented 6 months ago

Checks

Controller Version

0.9.1

Deployment Method

Helm

Checks

To Reproduce

1. In the values.yaml enable the metrics
metrics:
  controllerManagerAddr: ":8080"
  listenerAddr: ":8080"
  listenerEndpoint: "/metrics"

2. Now port-forward to the listener pod on the port configured (8080)

3. In a browser got to localhost:8080/metrics

You will get an EOF Error.

Describe the bug

I have enabled metrics in the gha-runner-scale-set-controller metrics: controllerManagerAddr: ":8080" listenerAddr: ":8080" listenerEndpoint: "/metrics"

I can see that the controller pod is exposing metrics on port 8080/metrics

` gha_controller_failed_ephemeral_runners gha_controller_pending_ephemeral_runners gha_controller_running_ephemeral_runners gha_controller_running_listeners

According to the documentation the listner is the owner of some metrics E.g.

gha_assigned_jobs gha_running_jobs However these metrics are not exposed on the controller or the listner. When I port-forward to the listner and go to the metrics endpoint e.g. localhost:8080/metrics I get an error

an error occurred forwarding 8080

Describe the expected behavior

When I port-forward to the listener I should get metrics in the same way I get metrics from the controller.

Additional Context

# Default values for gha-runner-scale-set-controller.
# This is a YAML-formatted file.
# Declare variables to be passed into your templates.
labels: {}

# leaderElection will be enabled when replicaCount>1,
# So, only one replica will in charge of reconciliation at a given time
# leaderElectionId will be set to {{ define gha-runner-scale-set-controller.fullname }}.
replicaCount: 1

image:
  repository: "ghcr.io/actions/gha-runner-scale-set-controller"
  pullPolicy: IfNotPresent
  # Overrides the image tag whose default is the chart appVersion.
  tag: ""

imagePullSecrets: []
nameOverride: ""
fullnameOverride: ""

env:
## Define environment variables for the controller pod
#  - name: "ENV_VAR_NAME_1"
#    value: "ENV_VAR_VALUE_1"
#  - name: "ENV_VAR_NAME_2"
#    valueFrom:
#      secretKeyRef:
#        key: ENV_VAR_NAME_2
#        name: secret-name
#        optional: true

serviceAccount:
  # Specifies whether a service account should be created for running the controller pod
  create: true
  # Annotations to add to the service account
  annotations: {}
  # The name of the service account to use.
  # If not set and create is true, a name is generated using the fullname template
  # You can not use the default service account for this.
  name: ""

podAnnotations:
  prometheus.io/scrape: "true"
  prometheus.io/port: "8080"

podLabels: {}

podSecurityContext: {}
# fsGroup: 2000

securityContext: {}
# capabilities:
#   drop:
#   - ALL
# readOnlyRootFilesystem: true
# runAsNonRoot: true
# runAsUser: 1000

resources: {}
## We usually recommend not to specify default resources and to leave this as a conscious
## choice for the user. This also increases chances charts run on environments with little
## resources, such as Minikube. If you do want to specify resources, uncomment the following
## lines, adjust them as necessary, and remove the curly braces after 'resources:'.
# limits:
#   cpu: 100m
#   memory: 128Mi
# requests:
#   cpu: 100m
#   memory: 128Mi

nodeSelector: {}

tolerations: []

affinity: {}

# Mount volumes in the container.
volumes: []
volumeMounts: []

# Leverage a PriorityClass to ensure your pods survive resource shortages
# ref: https://kubernetes.io/docs/concepts/configuration/pod-priority-preemption/
# PriorityClass: system-cluster-critical
priorityClassName: ""

## If `metrics:` object is not provided, or commented out, the following flags 
## will be applied the controller-manager and listener pods with empty values: 
## `--metrics-addr`, `--listener-metrics-addr`, `--listener-metrics-endpoint`. 
## This will disable metrics.
##
## To enable metrics, uncomment the following lines.
metrics:
  controllerManagerAddr: ":8080"
  listenerAddr: ":8080"
  listenerEndpoint: "/metrics"

flags:
  ## Log level can be set here with one of the following values: "debug", "info", "warn", "error".
  ## Defaults to "debug".
  logLevel: "debug"
  ## Log format can be set with one of the following values: "text", "json"
  ## Defaults to "text"
  logFormat: "text"

  ## Restricts the controller to only watch resources in the desired namespace.
  ## Defaults to watch all namespaces when unset.
  # watchSingleNamespace: ""

  ## Defines how the controller should handle upgrades while having running jobs.
  ##
  ## The strategies available are:
  ## - "immediate": (default) The controller will immediately apply the change causing the
  ##   recreation of the listener and ephemeral runner set. This can lead to an
  ##   overprovisioning of runners, if there are pending / running jobs. This should not
  ##   be a problem at a small scale, but it could lead to a significant increase of
  ##   resources if you have a lot of jobs running concurrently.
  ##
  ## - "eventual": The controller will remove the listener and ephemeral runner set
  ##   immediately, but will not recreate them (to apply changes) until all
  ##   pending / running jobs have completed.
  ##   This can lead to a longer time to apply the change but it will ensure
  ##   that you don't have any overprovisioning of runners.
  updateStrategy: "immediate"

Controller Logs

https://gist.github.com/isatfg/ad4246cdd8f93a2059569885e11f8729

Runner Pod Logs

https://gist.github.com/isatfg/91b656dd58d1ebe2b2316608daf87a33
github-actions[bot] commented 6 months ago

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

nikola-jokic commented 6 months ago

Hey @isatfg,

Please correct me if I'm wrong, but the error says that port forwarding is the problem. Is it possible that you tried to forward both the controller and the listener on the same port? I successfully forwarded both the controller and the listener metrics.

isatfg commented 6 months ago

Hey @nikola-jokic

hmm, so was trying to reproduce the issue again to explain the steps and now I see metrics on the listener as expected. I honestly have no idea what happened.

So now I have the metrics enabled and I can port-forward to the controller and get controller metrics and port-forward to the listener and get listener metrics. Apologies for wasting you time

Thank you