CloudSnorkel / cdk-github-runners

CDK constructs for self-hosted GitHub Actions runners
https://constructs.dev/packages/@cloudsnorkel/cdk-github-runners/
Apache License 2.0
255 stars 37 forks source link

Fargate runner fails to update itself #457

Closed kichik closed 3 weeks ago

kichik commented 7 months ago

Seen in #451. I had a runner image with 2.310.2 and it was trying to update itself to 2.311.0. It somehow failed. That resulted in:

The self-hosted runner: CloudSnorkel-cdk-github-runners-1f43e9e0-740f-11ee-8cdb-8912db59 lost communication with the server. Verify the machine is running and has a healthy network connection. Anything in your workflow that terminates the runner process, starves it for CPU/Memory, or blocks its network access can cause this error.

Runner logs showed:

2023-10-27T12:41:32.708-04:00   Current runner version: '2.310.2'
2023-10-27T12:41:32.709-04:00   2023-10-27 16:41:32Z: Listening for Jobs
2023-10-27T12:41:44.940-04:00   Runner update in progress, do not shutdown runner.
2023-10-27T12:41:45.055-04:00   Downloading 2.311.0 runner
2023-10-27T12:42:32.261-04:00   Waiting for current job finish running.
2023-10-27T12:42:32.304-04:00   Generate and execute update script.
2023-10-27T12:42:32.368-04:00   Runner will exit shortly for update, should be back online within 10 seconds.
2023-10-27T12:42:32.379-04:00   Runner update process finished.
2023-10-27T12:42:33.136-04:00   Runner listener exit because of updating, re-launch runner after successful update
2023-10-27T12:43:04.027-04:00   Restarting runner...
2023-10-27T12:43:04.053-04:00   /home/runner/run-helper.sh: line 36: /home/runner/bin/Runner.Listener: No such file or directory
  1. This consistently happened on Fargate x64, Fargate arm64, Fargate arm64 spot but not Fargate x64 spot. Fargate x64 and Fargate x64 spot use the same runner image.
  2. While updating, the runner program moves bin to bin.OLD_VERSION and then symlinks bin to bin.NEW_VERSION. It does this for both bin and externals folder. In this case, the bin symlink was missing.
  3. _diag logs showed no error creating the symlink (called junction because it's .NET):
     2023-10-27T16:38:06.498000+00:00 runner/runner/c130531fe6f641d2a96da9fcf2400571 [2023-10-27 16:38:01-8080] Create junction bin folder
     2023-10-27T16:38:06.498000+00:00 runner/runner/c130531fe6f641d2a96da9fcf2400571 [2023-10-27 16:38:01-8113] Create junction externals folder
  4. Many other provides were using the same 2.310.2 and upgraded successfully to 2.311.0. This includes, as mentioned in (1), Fargate x64 spot.
  5. Using the same image locally with Docker didn't show the same issue and the update logs looked basically the same.

fargate-failed-update-no-bin.log docker-good-update.log