Scaling runners based on webhook is sometimes stuck

damyan90 commented 1 year ago

Checks

[X] I've already read https://github.com/actions-runner-controller/actions-runner-controller/blob/master/TROUBLESHOOTING.md and I'm sure my issue is not covered in the troubleshooting guide.
[X] I'm not using a custom entrypoint in my runner image

Controller Version

0.26.0

Helm Chart Version

0.21.1

CertManager Version

v1.10.1

Deployment Method

Helm

cert-manager installation

source: https://charts.jetstack.io Values:

installCRDs: true
podDnsPolicy: 'None'
podDnsConfig:
  nameservers:
    - '1.1.1.1'
    - '8.8.8.8'

Standard helm upgrade --install

Checks

[X] This isn't a question or user support case (For Q&A and community support, go to Discussions. It might also be a good idea to contract with any of contributors and maintainers if your business is so critical and therefore you need priority support
[X] I've read releasenotes before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
[X] My actions-runner-controller version (v0.x.y) does support the feature
[X] I've already upgraded ARC (including the CRDs, see charts/actions-runner-controller/docs/UPGRADING.md for details) to the latest and it didn't fix the issue
[X] I've migrated to the workflow job webhook event (if you using webhook driven scaling)

Resource Definitions

apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  name: small-gha-runner
spec:
  template:
    spec:
      nodeSelector:
        kubernetes.azure.com/agentpool: "fourvcpueph"
      image: {{ .Values.image.repository }}/{{ .Values.image.name }}:{{ .Values.image.tag }}
      imagePullPolicy: {{ .Values.image.imagePullPolicy }}
      group: {{ .Values.github.runnersGroup }}
      organization: {{ .Values.github.organization }}
      labels:
        - small-gha-runner
        - ubuntu-latest-small
      resources:
        limits:
          memory: 5Gi
---
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  name: small-gha-runner-autoscaler
spec:
  scaleDownDelaySecondsAfterScaleOut: 30
  minReplicas: 1
  maxReplicas: 20
  scaleTargetRef:
    kind: RunnerDeployment
    name: small-gha-runner
  scaleUpTriggers:
    - githubEvent:
        workflowJob: {}
      duration: '10m'

To Reproduce

1. Define several github workflows with trigger set on Pull Requests
2. Ask developers to start working
3. Observe the situation when many workflows are triggered, new commits pushed. We also have:

concurrency:
  group: ${{ github.workflow }}-${{ github.ref }}
  cancel-in-progress: true

set in workflows, but that doesn't seem to happen only when such cancel takes place for a particular branch.

Describe the bug

Scaling is sometimes stuck for no visible reason. Editiing the horizontalrunnerautoscalers.actions.summerwind.dev and forcing minReplicas to be higher than current value also doesn't take place during that period.

No errors in any ACR component.

Jobs are pending in Queued state for up to 1hour during the day.

Describe the expected behavior

Runners are scaled up/down based on the number of queued jobs in Github workflow.

Whole Controller Logs

https://gist.github.com/damyan90/7ffacb6f48ae10f13fd5cf168da142ac

Whole Runner Pod Logs

Not relevant really, but here's an example one:
https://gist.github.com/damyan90/567979a84cbbd1210ad1ac423e7bac38

Additional Context

2022-12-06_10-20 2022-12-06_10-40 2022-12-06_10-40_1 2022-12-06_10-40_2

Webhook delivery: https://user-images.githubusercontent.com/24733538/205882331-086724a1-48b7-4e3f-ae77-23d35f959d02.png - 100% successful.

Runner's definition:

FROM summerwind/actions-runner:latest

RUN lsb_release -ra

ENV DIR=opt
COPY apt.sh /$DIR/
RUN sudo chmod +x /$DIR/apt.sh && sudo sh /$DIR/apt.sh 

COPY azure.sh /$DIR/
RUN sudo chmod +x /$DIR/azure.sh && sudo sh /$DIR/azure.sh

COPY software.sh software.json /$DIR/
RUN cd $DIR && sudo chmod +x /$DIR/software.sh && sudo sh /$DIR/software.sh

COPY cleanup.sh /$DIR/
RUN sudo chmod +x /$DIR/cleanup.sh && sudo sh /$DIR/cleanup.sh

RUN sudo apt-get update && sudo apt-get dist-upgrade -y

github-actions[bot] commented 1 year ago

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

mumoshu commented 1 year ago

I would have to cancel it first and then re-run?

Exactly. It should have worked if you had done so.

cristicalin commented 1 year ago

I'm experiencing a very similar issue. In my case I have minReplicas: 0 trying to scale up from 0 since these particular runners we need are quite large and don't want to keep idle ones running.

damyan90 commented 1 year ago

I would have to cancel it first and then re-run?

Exactly. It should have worked if you had done so.

Well, even if. Can you imagine reaction of developers who are supposed to do that for their CI pipelines whenever that thing is stuck? I would be frustrated ;)

mumoshu commented 1 year ago

Of course, I can 😄 However my conclusion was that, as a "developer" who uses ARC, I'd never want another system like Memcached, Redis, MySQL, etc., just to solve that problem. It isn't only about a code enhancement. But that's not the end of the world. Have you already read GitHub's announcement made on ARC's Discussions page?

damyan90 commented 1 year ago

Just did! That's awesome! ;)

davidsielert commented 1 year ago

i have the same problem, what I'm seeing though, is for matrix jobs its not dispatching an event for each matrix item. Which would be more related to a github platform issue.

ns-ggeorgiev commented 1 year ago

I have the same issue. And ARC in webhook mode is impossible to use with matrix builds at all. I am considering of backing the webhook with a second pull controller, so I can get the best of the two worlds. I wonther if anyone have tried that.

omer2500 commented 1 year ago

@mumoshu @damyan90 we are experiencing the same issue, did you solve it? seems like a existing bug we do see webhook (also 100% success) event for every matrix item and no errors to be found but when i expect for 20 runner to spawn up i see only 3-4 that were created the rest stays in queued for a long time.

UPDATE: Well we had 2 replicas for the webhook server after reduced it to 1 everything is working as it should

might be a racing issue? what do you think?

damyan90 commented 1 year ago

Not sure. I have 2 replicas for the webhook as well as for the controller and it works fine for now. I don't see too many issues, but I'm also not downscaling to 0. So might be that I'm just hiding the issue a bit. Waiting for a new versions though. There's been some development together with Github on the autoscaling matter, for now available only to some beta testers. I suppose these issues were addressed there too.

omer2500 commented 1 year ago

Thanks for the info @damyan90 actually after more testing even 1 webhook server pod didnt solve the issue

imagine that we have matrix job with 20 tasks but only 10 of them are scaling up the other 10 just hangs or some of them might be able to run, its a little bit random in my case down scaling to zero is a must because im using expensive machines hope that they will solve it soon.

shirkevich commented 1 year ago

scaling with webhook from 0 or 1 replica is not working for us either

controller version: 0.27.0

apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  name: runner56-deployment-autoscaler
spec:
  scaleDownDelaySecondsAfterScaleOut: 600
  scaleTargetRef:
    name: runner56-deployment
  minReplicas: 0
  maxReplicas: 3
  scaleUpTriggers:
    - githubEvent:
        workflowJob: {}
      duration: "30m"

Deliveries on GitHub are green. How can we debug it further?

adziura-ledger commented 1 year ago

We are expecting very similar issue.

Workflow queued, webhook sent from Github, github-webhook-server received the requests and trying to patch HRA. see the logs below:

2023-03-03T14:01:16Z    DEBUG   controllers.webhookbasedautoscaler  Patching hra my-failing-hra for capacityReservations update {"before": 0, "expired": -1, "added": 1, "completed": 0, "after": 1}

It's saying "before": 0, but in fact there is already a Runner (pod) up and running some job. And new Runner/pod is not creating.

So, in the end, all the next attempts to scale the RunnerDeployment are failing until the Job is finished on the existing "unrecognized" Runner.

@mumoshu, any ideas what can be the root cause?

Thanks in advance!

memdealer commented 1 year ago

Hi,

Having the same issue as above. Would be really nice if someone could take a look and resolve it or at least push it somewhere.

mumoshu commented 1 year ago

Hey @omer2500 @adziura-ledger @memdealer @shirkevich! I can't say for sure without seeing the full log of your ARC controller-manager and runner pods, but my theory is that your issue is equal to https://github.com/actions/actions-runner-controller/issues/2306#issuecomment-1445492738.

That is, ARC stops scaling up when it reaches maxReplicas. It happens more often when you have a large divergence between the actual max concurrent jobs queued at a time and maxReplicas.

Imagine you have 10x10 matrix in one of your workflow jobs... I guess you are affected by the issue when your maxReplicas is less than 1000.

It's already fixed in our main branch via #2258. So I would very much appreciate it if you could test it by building an ARC docker image from the head of our main branch and deploy it in your test environment!

omer2500 commented 1 year ago

@mumoshu i installed it about a week ago on our production env actually (we are on the move to github actions so its fine) i managed to spawn up 52 runners for 52 jobs that came from 10 pull requests the min replicas was 1 and the max replicas was 80, and it looks good so far!

i will try it with scaling from zero and update

actions / actions-runner-controller