actions / actions-runner-controller

Kubernetes controller for GitHub Actions self-hosted runners
Apache License 2.0
4.41k stars 1.04k forks source link

Include self correction on empty batch and avoid removing pending runners when cluster is busy #3426

Closed nikola-jokic closed 2 months ago

nikola-jokic commented 2 months ago

Currently, listener patches the ephemeral runner set when it receives a job complete message, or the last patch target count is different from the desired count.

When one job finishes, 2 things happen in parallel:

  1. Server notifies the listener. This will trigger a patch request to the ephemeral runner set
  2. Ephemeral runner controller starts setting its phase to finished.

However, if the point 1 happens before point 2, then the ephemeral runner set controller will try scaling down (total number > target number). The scale down favors pending pods. The assumption is that they had the least amount of time to start, so they will likely be the last ones to receive a job. However, with the new scaling model on patch ID, that assumption is wrong. There may be situation where the pending pod (that would eventually receive a job) gets deleted, and will only self-correct on job complete message, or on the next batch.

So this PR aims to:

  1. Change the patching behavior. Patching is done whenever something actionable happens. That means, on every message, increase the patch ID to signal that the state should be checked again.
  2. Empty messages are propagated as well. This is a cheap call. Long poll happens every ~50s, and if message is empty, that would effectively be ~1.2 req/minute.
  3. When ephemeral runner set is busy, it does not remove pending/running runners. It assumes that some of them will update their status eventually, and they will be cleaned up. This is also not expensive, since ephemeral runner controller removes the pod and the secret right away, so only ephemeral runner object is going to be persisted during this time.
  4. Put more significance to patch ID 0. PatchID 0 means that we should forcefully reach this state. This occurs in 3 situations now: 4.1. When listener starts. If something happens to the session, and the listener is restarted, it should force the state that is communicated by the actions service 4.2. When there is an empty batch. Empty batch allows us to self-correct in case something unexpected has happened. This will restart the patchID sequence 4.3. When draining mode is on. When the listener is removed, we should force 0 replicas and 0 patch ID, so we can drive this state to completion as quickly as we can.

Ephemeral runner tests are also split into chunks to avoid handling different scenarios in a single test.

Fixes #3420