actions / runner-container-hooks

Runner Container Hooks for GitHub Actions
MIT License
63 stars 41 forks source link

containerMode: kubernetes: `Command.RunScriptStep` in `packages/k8s/src/index.ts` never returns #124

Closed bjoto closed 6 months ago

bjoto commented 7 months ago

I'm running ARC in containerMode: kubernetes, and for some workflow steps, the step never completes.

This is deployed on GKE (Google K8s Engine) with Autopilot.

Workflow:

name: linux-riscv-ci-patches

defaults:
  run:
    shell: bash -leo pipefail {0}

on: push

jobs:
  build-patches:
    runs-on: arc-runner-set-2
    timeout-minutes: 50400 # 35 days
    container:
      image: ghcr.io/linux-riscv/pw-runner-multi:latest
    steps:
      - name: Configure git
        run: |
          git config --global --add safe.directory '*'
      - name: Checkout git
        uses: actions/checkout@v4
        with:
          ref: ${{ github.event.pull_request.head.sha }}
          fetch-depth: 0
      - run: pwd
      - run: ls -l
      - name: Run checks
        run: |
          bash .github/scripts/patches.sh |& tee /build/run.log

The last Run checks step never completes. I've done a build of the Runner that has additional logging, and I can verify that Command.RunScriptStep in packages/k8s/src/index.ts never completes -- i.e. the hook script /home/runner/k8s/index.js never returns.

When I exec into the workflow container, I can verify that the script has completed, and is not running any more.

The Run check step takes ~1h to complete.

Expected:

Helm setup:

githubConfigUrl: "https://github.com/bjoto/linux"
githubConfigSecret:
  github_token: "REDACTED"

maxRunners: 10
minRunners: 0

containerMode:
  type: "kubernetes"

template:
  spec:
    securityContext:
      fsGroup: 123
    containers:
      - name: runner
        image: ghcr.io/linux-riscv/runner-multi:latest
        command: ["/home/runner/run.sh"]
        env:
          - name: ACTIONS_RUNNER_CONTAINER_HOOK_TEMPLATE
            value: "/home/runner/config/worker-podspec.yaml"
          - name: ACTIONS_RUNNER_USE_KUBE_SCHEDULER
            value: "true"
          - name: ACTIONS_STEP_DEBUG
            value: "true"
          - name: ACTIONS_RUNNER_DEBUG
            value: "true"
          - name: RUNNER_DEBUG
            value: "1"
        volumeMounts:
          - name: worker-podspec-volume
            mountPath: /home/runner/config
            readOnly: true
    volumes:
      - name: worker-podspec-volume
        configMap:
          name: worker-cm
          items:
            - key: worker-podspec.yaml
              path: worker-podspec.yaml
      - name: work
        ephemeral:
          volumeClaimTemplate:
            spec:
              storageClassName: "premium-rwx"
              accessModes: [ "ReadWriteMany" ]
              resources:
                requests:
                  storage: 100Gi
bjoto commented 7 months ago

It's 100% reproducible. Please let me know how I can help debug the issue; This is the last blocker (famous last words ;-)) to move our dedicated runners to ARC/GKE.

nielstenboom commented 6 months ago

I'm thinking it might be related to https://github.com/actions/runner-container-hooks/issues/107

bjoto commented 6 months ago

I thought so as well; I added the try/catch with logging, but it never got hit, unfortunately. :-(

Patch:

diff --git a/packages/k8s/src/k8s/index.ts b/packages/k8s/src/k8s/index.ts
index aac844097c59..f2b3b58457e6 100644
--- a/packages/k8s/src/k8s/index.ts
+++ b/packages/k8s/src/k8s/index.ts
@@ -227,30 +227,34 @@ export async function execPodStep(
 ): Promise<void> {
   const exec = new k8s.Exec(kc)
   await new Promise(async function (resolve, reject) {
-    await exec.exec(
-      namespace(),
-      podName,
-      containerName,
-      command,
-      process.stdout,
-      process.stderr,
-      stdin ?? null,
-      false /* tty */,
-      resp => {
-        // kube.exec returns an error if exit code is not 0, but we can't actually get the exit code
-        if (resp.status === 'Success') {
-          resolve(resp.code)
-        } else {
-          core.debug(
-            JSON.stringify({
-              message: resp?.message,
-              details: resp?.details
-            })
-          )
-          reject(resp?.message)
-        }
-      }
-    )
+    try {
+       await exec.exec(
+           namespace(),
+           podName,
+           containerName,
+           command,
+           process.stdout,
+           process.stderr,
+           stdin ?? null,
+           false /* tty */,
+           resp => {
+               // kube.exec returns an error if exit code is not 0, but we can't actually get the exit code
+               if (resp.status === 'Success') {
+                   resolve(resp.code)
+               } else {
+                   core.debug(
+                       JSON.stringify({
+                           message: resp?.message,
+                           details: resp?.details
+                       })
+                   )
+                   reject(resp?.message)
+               }
+           }
+       )
+    }  catch (error) {
+       core.error(`BT Failed to exec pod step: ${error}`)
+    }
   })
 }

I'm far from a websocket/typescript person, so I'm very much out on a limb here...

nielstenboom commented 6 months ago

Maybe you could try to extend the catch with this:

catch (error) {
    core.error(`BT Failed to exec pod step: ${error}`)
    reject(error)
}

And see if that does it?

nikola-jokic commented 6 months ago

It shouldn't be the problem with the try-catch there, since that one may fail only if websocket communication fails (naive opinion for now, I did not find time to actually work on it but I'm hoping to find it soon)

I'm hoping it is something much simpler, like changing tty to true. But if you find the problem before me, I would be happy to review it :smile:!

bjoto commented 6 months ago

@nielstenboom I'll spin up a test first thing tomorrow! Note that I did not see the error logging, meaning I didn't land in the catch... :-(

bjoto commented 6 months ago

@nikola-jokic I was trying to figure out if I could somehow log in the node if the websocket went down, but I couldn't wrap my head around it.

bjoto commented 6 months ago

Feel free to hit me up with snippets to try, and I'll take them for a spin!

bjoto commented 6 months ago

@nielstenboom Took your PR #125 for a spin -- same hang, unfortunately.

nielstenboom commented 6 months ago

@nielstenboom Took your PR #125 for a spin -- same hang, unfortunately.

Hmmm unfortunate :(

Which k8s version are you guys running? One more thing you could try maybe is bumping the @kubernetes/client-node package? No clue if it'll help, just spewing random ideas here haha 😄

bjoto commented 6 months ago

@nielstenboom Took your PR #125 for a spin -- same hang, unfortunately.

Hmmm unfortunate :(

Which k8s version are you guys running? One more thing you could try maybe is bumping the @kubernetes/client-node package? No clue if it'll help, just spewing random ideas here haha 😄

It's GKE: v1.27.7-gke.1056000

Random ideas are very much welcome! TY!

nikola-jokic commented 6 months ago

Hey @bjoto,

I am failing to reproduce the issue... Can you please add logs so we can trace where exactly in the hook we block? Also, during the execution of the job, could you also inspect pods and values applied to it? Maybe the pod crashes and something strange happens, I don't know... Maybe modify the execPodStep so we can see if the error is during communication. What I have in mind is this:

export async function execPodStep(
  command: string[],
  podName: string,
  containerName: string,
  stdin?: stream.Readable
): Promise<void> {
  const exec = new k8s.Exec(kc)
  await new Promise(function (resolve, reject) {
    exec
      .exec(
        namespace(),
        podName,
        containerName,
        command,
        process.stdout,
        process.stderr,
        stdin ?? null,
        false /* tty */,
        resp => {
          // kube.exec returns an error if exit code is not 0, but we can't actually get the exit code
          if (resp.status === 'Success') {
            resolve(resp.code)
          } else {
            core.debug(
              JSON.stringify({
                message: resp?.message,
                details: resp?.details
              })
            )
            reject(resp?.message)
          }
        }
      )
      // eslint-disable-next-line github/no-then
      .catch(err => reject(err))
  })
}
bjoto commented 6 months ago

@nikola-jokic TY for spending time on this! I'll try the patch. Wdym by inspecting the pod? What data are you looking for? LMK, and I'll dig it out.

Some clarifications: The workflow job is running on a separate node than the runner. I'm exec:ing into the workflow to verify that the job has completed correctly. The pod is running, but idle.

bjoto commented 6 months ago

@nikola-jokic Re-ran with your patch, and same behavior -- the step hangs.

Here are the context of the 'BT' logs:

$ git g -A 2 -B 2 BT\ 
packages/k8s/src/hooks/prepare-job.ts
42-  await copyExternalsToRoot()
43-
44:  core.debug('BT strudel')
45-  let container: k8s.V1Container | undefined = undefined
46-  if (args.container?.image) {

packages/k8s/src/hooks/run-script-step.ts
23-  args.entryPoint = 'sh'
24-  args.entryPointArgs = ['-e', containerPath]
25:  core.debug('BT args')
26-  let arr = [args.entryPoint, ...args.entryPointArgs]
27-  arr.forEach(function(element) {
--
29-  });    
30-  
31:  core.debug('BT args end')
32-  try {
33-    await execPodStep(

packages/k8s/src/index.ts
25-    }
26-    let exitCode = 0
27:    core.debug(`BT woop run ${args} ${command}`)
28-    switch (command) {
29-      case Command.PrepareJob:
--
34-        return process.exit(0)
35-      case Command.RunScriptStep:
36:        core.debug(`BT Command.RunScriptStep pre ${args}`)
37-        await runScriptStep(args, state, null)
38:        core.debug(`BT Command.RunScriptStep post ${args}`)
39-        return process.exit(0)
40-      case Command.RunContainerStep:

Successful step:

2023-12-27T08:18:41.1933537Z ##[debug]BT woop run [object Object] run_script_step
2023-12-27T08:18:41.1934897Z ##[debug]BT Command.RunScriptStep pre [object Object]
2023-12-27T08:18:41.2775731Z ##[debug]BT args
2023-12-27T08:18:41.2786074Z sh
2023-12-27T08:18:41.2790621Z -e
2023-12-27T08:18:41.2791820Z /__w/_temp/8b338780-a490-11ee-bfac-6d5c9b08e4f1.sh
2023-12-27T08:18:41.2793781Z ##[debug]BT args end
2023-12-27T08:18:41.3986158Z total 1012
2023-12-27T08:18:41.3988069Z -rw-r--r--   1 root 123    496 Dec 27 08:07 COPYING
2023-12-27T08:18:41.3989335Z -rw-r--r--   1 root 123 102435 Dec 27 08:07 CREDITS
2023-12-27T08:18:41.3990690Z drwxr-sr-x  74 root 123   4096 Dec 27 08:09 Documentation
2023-12-27T08:18:41.3992062Z -rw-r--r--   1 root 123   2573 Dec 27 08:09 Kbuild
2023-12-27T08:18:41.3993371Z -rw-r--r--   1 root 123    555 Dec 27 08:09 Kconfig
2023-12-27T08:18:41.3994773Z drwxr-sr-x   6 root 123   4096 Dec 27 08:09 LICENSES
2023-12-27T08:18:41.3996729Z -rw-r--r--   1 root 123 733992 Dec 27 08:09 MAINTAINERS
2023-12-27T08:18:41.3998135Z -rw-r--r--   1 root 123  67315 Dec 27 08:09 Makefile
2023-12-27T08:18:41.3999489Z -rw-r--r--   1 root 123    727 Dec 27 08:09 README
2023-12-27T08:18:41.4000790Z drwxr-sr-x  23 root 123   4096 Dec 27 08:11 arch
2023-12-27T08:18:41.4002133Z drwxr-sr-x   3 root 123   4096 Dec 27 08:11 block
2023-12-27T08:18:41.4003515Z drwxr-sr-x   2 root 123   4096 Dec 27 08:11 certs
2023-12-27T08:18:41.4004905Z drwxr-sr-x   4 root 123   4096 Dec 27 08:11 crypto
2023-12-27T08:18:41.4006555Z drwxr-sr-x 143 root 123   4096 Dec 27 08:15 drivers
2023-12-27T08:18:41.4007861Z drwxr-sr-x  82 root 123   4096 Dec 27 08:15 fs
2023-12-27T08:18:41.4009009Z drwxr-sr-x  31 root 123   4096 Dec 27 08:16 include
2023-12-27T08:18:41.4010242Z drwxr-sr-x   2 root 123   4096 Dec 27 08:16 init
2023-12-27T08:18:41.4011530Z drwxr-sr-x   2 root 123   4096 Dec 27 08:16 io_uring
2023-12-27T08:18:41.4012919Z drwxr-sr-x   2 root 123   4096 Dec 27 08:16 ipc
2023-12-27T08:18:41.4014246Z drwxr-sr-x  22 root 123   4096 Dec 27 08:16 kernel
2023-12-27T08:18:41.4015938Z drwxr-sr-x  21 root 123  12288 Dec 27 08:16 lib
2023-12-27T08:18:41.4017242Z drwxr-sr-x   6 root 123   4096 Dec 27 08:16 mm
2023-12-27T08:18:41.4018555Z drwxr-sr-x  73 root 123   4096 Dec 27 08:17 net
2023-12-27T08:18:41.4019761Z drwxr-sr-x   7 root 123   4096 Dec 27 08:17 rust
2023-12-27T08:18:41.4021872Z drwxr-sr-x  41 root 123   4096 Dec 27 08:17 samples
2023-12-27T08:18:41.4023201Z drwxr-sr-x  17 root 123   4096 Dec 27 08:17 scripts
2023-12-27T08:18:41.4024771Z drwxr-sr-x  14 root 123   4096 Dec 27 08:17 security
2023-12-27T08:18:41.4026412Z drwxr-sr-x  27 root 123   4096 Dec 27 08:17 sound
2023-12-27T08:18:41.4027768Z drwxr-sr-x  43 root 123   4096 Dec 27 08:18 tools
2023-12-27T08:18:41.4029093Z drwxr-sr-x   4 root 123   4096 Dec 27 08:18 usr
2023-12-27T08:18:41.4030420Z drwxr-sr-x   4 root 123   4096 Dec 27 08:18 virt
2023-12-27T08:18:41.4073086Z ##[debug]BT Command.RunScriptStep post [object Object]
2023-12-27T08:18:41.4932244Z ##[debug]Finishing: Run ls -l

Hanging step:

2023-12-27T08:18:41.5979154Z ##[group]Run bash .github/scripts/patches.sh |& tee /build/run.log
2023-12-27T08:18:41.5980856Z ESC[36;1mbash .github/scripts/patches.sh |& tee /build/run.logESC[0m
2023-12-27T08:18:41.5982659Z shell: bash -leo pipefail {0}
2023-12-27T08:18:41.5983477Z ##[endgroup]
2023-12-27T08:18:41.5987726Z ##[debug]Overwrite 'shell' base on job defaults.
2023-12-27T08:18:41.6050359Z ##[debug]bash -leo pipefail /__w/_temp/ed221d29-fb43-40f8-a3a5-515cb54834ef.sh
2023-12-27T08:18:41.6725938Z ##[group]Run '/home/runner/k8s/index.js'
2023-12-27T08:18:41.6729283Z shell: /home/runner/externals/node16/bin/node {0}
2023-12-27T08:18:41.6733009Z ##[endgroup]
2023-12-27T08:18:41.6783784Z ##[debug]/home/runner/externals/node16/bin/node /home/runner/k8s/index.js
2023-12-27T08:18:43.1840016Z ##[debug]BT woop run [object Object] run_script_step
2023-12-27T08:18:43.1844066Z ##[debug]BT Command.RunScriptStep pre [object Object]
2023-12-27T08:18:43.1898609Z ##[debug]BT args
2023-12-27T08:18:43.1903980Z sh
2023-12-27T08:18:43.1905849Z -e
2023-12-27T08:18:43.1906888Z /__w/_temp/8c576e10-a490-11ee-962b-ad34d67a6ae7.sh
2023-12-27T08:18:43.1908466Z ##[debug]BT args end
2023-12-27T08:18:43.2769477Z + set -euo pipefail
2023-12-27T08:18:43.2773362Z ++ dirname .github/scripts/patches.sh
2023-12-27T08:18:43.2775044Z + d=.github/scripts
2023-12-27T08:18:43.2775925Z + . .github/scripts/helpers.sh
2023-12-27T08:18:43.2814751Z ++ git log -1 --pretty=%H .github/scripts/helpers.sh
2023-12-27T08:18:43.2959880Z + basesha=972ec36094a51b0f856ad707e6513db54e2135ea
2023-12-27T08:18:43.2962454Z + patches=($(git rev-list --reverse ${basesha}..HEAD))
2023-12-27T08:18:43.2963666Z ++ git rev-list --reverse 972ec36094a51b0f856ad707e6513db54e2135ea..HEAD
2023-12-27T08:18:43.3124806Z + patch_tot=2
2023-12-27T08:18:43.3125820Z + rc=0
2023-12-27T08:18:43.3126293Z + cnt=1
2023-12-27T08:18:43.3157369Z + parallel -j 3 --colsep=, bash .github/scripts/patch_tester.sh '{1}' '{2}' '{3}' :::: /dev/fd/63
2023-12-27T08:18:43.3159113Z ++ for i in "${patches[@]}"
2023-12-27T08:18:43.3160225Z ++ echo de24a958dd1ef6a9347d4645084756f55f29455f,1,2
2023-12-27T08:18:43.3161277Z ++ cnt=2
2023-12-27T08:18:43.3161911Z ++ for i in "${patches[@]}"
2023-12-27T08:18:43.3162996Z ++ echo 6ef5e665c4a16eb41cf5a2c512c8a3caf82d8cae,2,2
2023-12-27T08:18:43.3164094Z ++ cnt=3

Never get to the "BT... post" log, and no exceptions.

nikola-jokic commented 6 months ago

Did you try executing the script manually with these inputs? It does look from the log that the script hangs, but is it possible that the script does not exit with these inputs? Can you please try executing any other simple script? This does not indicate any problem with run script step especially since web socket does not throw. Please let me know if other script executes successfully.

bjoto commented 6 months ago

@nikola-jokic

Did you try executing the script manually with these inputs? It does look from the log that the script hangs, but is it possible that the script does not exit with these inputs?

I kubectl-exec into the pod/container, and verify that the script has finished successfully (it's not running, and the logs from the script contain correct output), but for some reason the execPod never returns. Note that this script/step works when running a dedicated runner, but w/o ARC/K8s

Please let me know if other script executes successfully.

A simpler script, e.g. https://github.com/bjoto/linux/actions/runs/7356297121/job/20026212978 completes from a runner perspective.

nikola-jokic commented 6 months ago

I have tried not using the promise, but inspecting the source code here does seem like we should only catch the websocket error the way I did it in the previous comment... Did you try changing the tty parameter? If the script works on the regular runner, is it possible that resource requirements are a problem for the job pod? Can you monitor the job pod to see if there are any interruptions?

bjoto commented 6 months ago

Progress! The step that hangs doesn't provide any output to stdout/stderr for ~1.5h. So, I tried adding a keepalive:

(while true ; do sleep 2; echo .; done) &
progress=$!

# 1.5h work

kill $progress

And the step succeeds! So it seems like the exec socket times out, without notifying the caller, similar to what SSH can experience. This seems to be a workaround -- ugly -- but still.

bjoto commented 6 months ago

To summarize; This seems to be related to how kubectl-exec behaves. The workaround above (making sure that the channel is active) is sufficient for me to move forward with using ARC.

A quick goolging for kubectl-exec timeouts for long running jobs is not super uncommon. Potential fixes, like changing the kubelet config (https://kubernetes.io/docs/reference/config-api/kubelet-config.v1beta1 streamingConnectionIdleTimeout and similar) might work, but AFAIU this is not something that can be easily changed on managed k8s like GKE/Autopilot.

@nikola-jokic I'm open to closing this issue for now. I think that as long as we're depending on kubectl-exec we can experience issues like this. :-(

nikola-jokic commented 6 months ago

I agree with you, this is really related to kubernetes and node client that we depend on... If the websocket does not throw, and callback is never called, there is nothing we can do on the hook. :disappointed: