actions / actions-runner-controller

Kubernetes controller for GitHub Actions self-hosted runners
Apache License 2.0
4.42k stars 1.04k forks source link

Unable to receive `JobAssigned` message when the runnerScaleSet is recreated #3363

Closed Fatme closed 3 months ago

Fatme commented 3 months ago

Checks

Controller Version

latest

Deployment Method

Other

Checks

To Reproduce

Setup the listener locally:

1. Clone `https://github.com/Fatme/actions-runner-controller/tree/master` - it adds logic for create/delete the runnerScaleSet from the listener
2. Go to `cmd/ghalistener`
3. Run `go build -o app`
4. Prepare config.json file

{
    "ConfigureUrl": "<my-repo-url>",
    "EphemeralRunnerSetName": "test1",
    "EphemeralRunnerSetNamespace": "actions-runner-system",
    "RunnerScaleSetName": "<my-runner-scaleset-name>",
    "AppID": <my-app-id>,
    "AppInstallationID": <my-app-installation-id>,
    "AppPrivateKey": "<my-app-private-key>"
}
  1. Start the listener locally by running LISTENER_CONFIG_PATH=<path-to-config.json> ./app/ghalistener

Steps to repro the issue:

  1. Trigger a CI build
  2. Ensure the listener is up and running locally
  3. Stop the listener while the CI build is running
  4. Start the listener again

Expected behavior: The CI build should be executed when the listener is restarted Actual behavior: The CI build is not executed at all. It hangs forever


### Describe the bug

It seems that `JobAssigned ` message is not received when the runnerScaleSet is recreated and the listener is restarted.

* JobAvailable message is received
* AcquireJobs returns success result
* `JobAssigned` is never received

### Describe the expected behavior

The expected behavior is to receive `JobAssigned` in this situation.

### Additional Context

```yaml
Here is the content of the config.json file

{
    "ConfigureUrl": "<my-repo-url>",
    "EphemeralRunnerSetName": "test1",
    "EphemeralRunnerSetNamespace": "actions-runner-system",
    "RunnerScaleSetName": "<my-runner-scaleset-name>",
    "AppID": <my-app-id>,
    "AppInstallationID": <my-app-installation-id>,
    "AppPrivateKey": "<my-app-private-key>"
}
``

Controller Logs

Here are the logs from the locally running listener

2024-03-18T14:57:31+02:00       INFO    listener-app    app initialized
2024-03-18T14:57:31+02:00       INFO    listener-app    Starting listener
2024-03-18T14:57:31+02:00       INFO    listener-app    refreshing token        {"githubConfigUrl": "https://github.com/Fatme/test-gh-orka-integration"}
2024-03-18T14:57:31+02:00       INFO    listener-app    getting access token for GitHub App auth        {"accessTokenURL": "https://api.github.com/app/installations/45969017/access_tokens"}
2024-03-18T14:57:31+02:00       INFO    listener-app    getting runner registration token       {"registrationTokenURL": "https://api.github.com/repos/Fatme/test-gh-orka-integration/actions/runners/registration-token"}
2024-03-18T14:57:32+02:00       INFO    listener-app    getting Actions tenant URL and JWT      {"registrationURL": "https://api.github.com/actions/runner-registration"}
2024-03-18T14:57:33+02:00       INFO    listener-app.listener   Successfully created runnerScaleSet with ID     {"ID": 433}
2024-03-18T14:57:33+02:00       INFO    listener-app.listener   Current runner scale set statistics.    {"statistics": "{\"totalAvailableJobs\":0,\"totalAcquiredJobs\":0,\"totalAssignedJobs\":0,\"totalRunningJobs\":0,\"totalRegisteredRunners\":0,\"totalBusyRunners\":0,\"totalIdleRunners\":0}"}
2024-03-18T14:57:33+02:00       INFO    listener-app.listener   Getting next message    {"lastMessageID": 0}
2024-03-18T14:57:39+02:00       INFO    listener-app.listener   Processing message      {"messageId": 1, "messageType": "RunnerScaleSetJobMessages"}
2024-03-18T14:57:39+02:00       INFO    listener-app.listener   New runner scale set statistics.        {"statistics": {"totalAvailableJobs":1,"totalAcquiredJobs":0,"totalAssignedJobs":0,"totalRunningJobs":0,"totalRegisteredRunners":0,"totalBusyRunners":0,"totalIdleRunners":0}}
2024-03-18T14:57:39+02:00       INFO    listener-app.listener   Job available message received  {"jobId": 359}
2024-03-18T14:57:39+02:00       INFO    listener-app.listener   Acquiring jobs  {"count": 1, "requestIds": "[359]"}
2024-03-18T14:57:40+02:00       INFO    listener-app.listener   Jobs are acquired       {"count": 1, "requestIds": "[359]"}
2024-03-18T14:57:40+02:00       INFO    listener-app.listener   Deleting last message   {"lastMessageID": 1}
2024-03-18T14:57:41+02:00       INFO    listener-app.listener   Getting next message    {"lastMessageID": 1}

Runner Pod Logs

N/A
github-actions[bot] commented 3 months ago

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

nikola-jokic commented 3 months ago

Hey @Fatme,

I failed to reproduce this issue. I spawned the listener and force error before the patch. When the listener was back up, it picked up on assigned jobs by listening on statistics, and later was able to scale. We scale based on the statistics, but we work on message types.