Closed bjoto closed 6 months ago
It's 100% reproducible. Please let me know how I can help debug the issue; This is the last blocker (famous last words ;-)) to move our dedicated runners to ARC/GKE.
I'm thinking it might be related to https://github.com/actions/runner-container-hooks/issues/107
I thought so as well; I added the try/catch with logging, but it never got hit, unfortunately. :-(
Patch:
diff --git a/packages/k8s/src/k8s/index.ts b/packages/k8s/src/k8s/index.ts
index aac844097c59..f2b3b58457e6 100644
--- a/packages/k8s/src/k8s/index.ts
+++ b/packages/k8s/src/k8s/index.ts
@@ -227,30 +227,34 @@ export async function execPodStep(
): Promise<void> {
const exec = new k8s.Exec(kc)
await new Promise(async function (resolve, reject) {
- await exec.exec(
- namespace(),
- podName,
- containerName,
- command,
- process.stdout,
- process.stderr,
- stdin ?? null,
- false /* tty */,
- resp => {
- // kube.exec returns an error if exit code is not 0, but we can't actually get the exit code
- if (resp.status === 'Success') {
- resolve(resp.code)
- } else {
- core.debug(
- JSON.stringify({
- message: resp?.message,
- details: resp?.details
- })
- )
- reject(resp?.message)
- }
- }
- )
+ try {
+ await exec.exec(
+ namespace(),
+ podName,
+ containerName,
+ command,
+ process.stdout,
+ process.stderr,
+ stdin ?? null,
+ false /* tty */,
+ resp => {
+ // kube.exec returns an error if exit code is not 0, but we can't actually get the exit code
+ if (resp.status === 'Success') {
+ resolve(resp.code)
+ } else {
+ core.debug(
+ JSON.stringify({
+ message: resp?.message,
+ details: resp?.details
+ })
+ )
+ reject(resp?.message)
+ }
+ }
+ )
+ } catch (error) {
+ core.error(`BT Failed to exec pod step: ${error}`)
+ }
})
}
I'm far from a websocket/typescript person, so I'm very much out on a limb here...
Maybe you could try to extend the catch with this:
catch (error) {
core.error(`BT Failed to exec pod step: ${error}`)
reject(error)
}
And see if that does it?
It shouldn't be the problem with the try-catch there, since that one may fail only if websocket communication fails (naive opinion for now, I did not find time to actually work on it but I'm hoping to find it soon)
I'm hoping it is something much simpler, like changing tty to true. But if you find the problem before me, I would be happy to review it :smile:!
@nielstenboom I'll spin up a test first thing tomorrow! Note that I did not see the error logging, meaning I didn't land in the catch... :-(
@nikola-jokic I was trying to figure out if I could somehow log in the node if the websocket went down, but I couldn't wrap my head around it.
Feel free to hit me up with snippets to try, and I'll take them for a spin!
@nielstenboom Took your PR #125 for a spin -- same hang, unfortunately.
@nielstenboom Took your PR #125 for a spin -- same hang, unfortunately.
Hmmm unfortunate :(
Which k8s version are you guys running? One more thing you could try maybe is bumping the @kubernetes/client-node package? No clue if it'll help, just spewing random ideas here haha 😄
@nielstenboom Took your PR #125 for a spin -- same hang, unfortunately.
Hmmm unfortunate :(
Which k8s version are you guys running? One more thing you could try maybe is bumping the @kubernetes/client-node package? No clue if it'll help, just spewing random ideas here haha 😄
It's GKE: v1.27.7-gke.1056000
Random ideas are very much welcome! TY!
Hey @bjoto,
I am failing to reproduce the issue... Can you please add logs so we can trace where exactly in the hook we block? Also, during the execution of the job, could you also inspect pods and values applied to it? Maybe the pod crashes and something strange happens, I don't know... Maybe modify the execPodStep so we can see if the error is during communication. What I have in mind is this:
export async function execPodStep(
command: string[],
podName: string,
containerName: string,
stdin?: stream.Readable
): Promise<void> {
const exec = new k8s.Exec(kc)
await new Promise(function (resolve, reject) {
exec
.exec(
namespace(),
podName,
containerName,
command,
process.stdout,
process.stderr,
stdin ?? null,
false /* tty */,
resp => {
// kube.exec returns an error if exit code is not 0, but we can't actually get the exit code
if (resp.status === 'Success') {
resolve(resp.code)
} else {
core.debug(
JSON.stringify({
message: resp?.message,
details: resp?.details
})
)
reject(resp?.message)
}
}
)
// eslint-disable-next-line github/no-then
.catch(err => reject(err))
})
}
@nikola-jokic TY for spending time on this! I'll try the patch. Wdym by inspecting the pod? What data are you looking for? LMK, and I'll dig it out.
Some clarifications: The workflow job is running on a separate node than the runner. I'm exec:ing into the workflow to verify that the job has completed correctly. The pod is running, but idle.
@nikola-jokic Re-ran with your patch, and same behavior -- the step hangs.
Here are the context of the 'BT' logs:
$ git g -A 2 -B 2 BT\
packages/k8s/src/hooks/prepare-job.ts
42- await copyExternalsToRoot()
43-
44: core.debug('BT strudel')
45- let container: k8s.V1Container | undefined = undefined
46- if (args.container?.image) {
packages/k8s/src/hooks/run-script-step.ts
23- args.entryPoint = 'sh'
24- args.entryPointArgs = ['-e', containerPath]
25: core.debug('BT args')
26- let arr = [args.entryPoint, ...args.entryPointArgs]
27- arr.forEach(function(element) {
--
29- });
30-
31: core.debug('BT args end')
32- try {
33- await execPodStep(
packages/k8s/src/index.ts
25- }
26- let exitCode = 0
27: core.debug(`BT woop run ${args} ${command}`)
28- switch (command) {
29- case Command.PrepareJob:
--
34- return process.exit(0)
35- case Command.RunScriptStep:
36: core.debug(`BT Command.RunScriptStep pre ${args}`)
37- await runScriptStep(args, state, null)
38: core.debug(`BT Command.RunScriptStep post ${args}`)
39- return process.exit(0)
40- case Command.RunContainerStep:
Successful step:
2023-12-27T08:18:41.1933537Z ##[debug]BT woop run [object Object] run_script_step
2023-12-27T08:18:41.1934897Z ##[debug]BT Command.RunScriptStep pre [object Object]
2023-12-27T08:18:41.2775731Z ##[debug]BT args
2023-12-27T08:18:41.2786074Z sh
2023-12-27T08:18:41.2790621Z -e
2023-12-27T08:18:41.2791820Z /__w/_temp/8b338780-a490-11ee-bfac-6d5c9b08e4f1.sh
2023-12-27T08:18:41.2793781Z ##[debug]BT args end
2023-12-27T08:18:41.3986158Z total 1012
2023-12-27T08:18:41.3988069Z -rw-r--r-- 1 root 123 496 Dec 27 08:07 COPYING
2023-12-27T08:18:41.3989335Z -rw-r--r-- 1 root 123 102435 Dec 27 08:07 CREDITS
2023-12-27T08:18:41.3990690Z drwxr-sr-x 74 root 123 4096 Dec 27 08:09 Documentation
2023-12-27T08:18:41.3992062Z -rw-r--r-- 1 root 123 2573 Dec 27 08:09 Kbuild
2023-12-27T08:18:41.3993371Z -rw-r--r-- 1 root 123 555 Dec 27 08:09 Kconfig
2023-12-27T08:18:41.3994773Z drwxr-sr-x 6 root 123 4096 Dec 27 08:09 LICENSES
2023-12-27T08:18:41.3996729Z -rw-r--r-- 1 root 123 733992 Dec 27 08:09 MAINTAINERS
2023-12-27T08:18:41.3998135Z -rw-r--r-- 1 root 123 67315 Dec 27 08:09 Makefile
2023-12-27T08:18:41.3999489Z -rw-r--r-- 1 root 123 727 Dec 27 08:09 README
2023-12-27T08:18:41.4000790Z drwxr-sr-x 23 root 123 4096 Dec 27 08:11 arch
2023-12-27T08:18:41.4002133Z drwxr-sr-x 3 root 123 4096 Dec 27 08:11 block
2023-12-27T08:18:41.4003515Z drwxr-sr-x 2 root 123 4096 Dec 27 08:11 certs
2023-12-27T08:18:41.4004905Z drwxr-sr-x 4 root 123 4096 Dec 27 08:11 crypto
2023-12-27T08:18:41.4006555Z drwxr-sr-x 143 root 123 4096 Dec 27 08:15 drivers
2023-12-27T08:18:41.4007861Z drwxr-sr-x 82 root 123 4096 Dec 27 08:15 fs
2023-12-27T08:18:41.4009009Z drwxr-sr-x 31 root 123 4096 Dec 27 08:16 include
2023-12-27T08:18:41.4010242Z drwxr-sr-x 2 root 123 4096 Dec 27 08:16 init
2023-12-27T08:18:41.4011530Z drwxr-sr-x 2 root 123 4096 Dec 27 08:16 io_uring
2023-12-27T08:18:41.4012919Z drwxr-sr-x 2 root 123 4096 Dec 27 08:16 ipc
2023-12-27T08:18:41.4014246Z drwxr-sr-x 22 root 123 4096 Dec 27 08:16 kernel
2023-12-27T08:18:41.4015938Z drwxr-sr-x 21 root 123 12288 Dec 27 08:16 lib
2023-12-27T08:18:41.4017242Z drwxr-sr-x 6 root 123 4096 Dec 27 08:16 mm
2023-12-27T08:18:41.4018555Z drwxr-sr-x 73 root 123 4096 Dec 27 08:17 net
2023-12-27T08:18:41.4019761Z drwxr-sr-x 7 root 123 4096 Dec 27 08:17 rust
2023-12-27T08:18:41.4021872Z drwxr-sr-x 41 root 123 4096 Dec 27 08:17 samples
2023-12-27T08:18:41.4023201Z drwxr-sr-x 17 root 123 4096 Dec 27 08:17 scripts
2023-12-27T08:18:41.4024771Z drwxr-sr-x 14 root 123 4096 Dec 27 08:17 security
2023-12-27T08:18:41.4026412Z drwxr-sr-x 27 root 123 4096 Dec 27 08:17 sound
2023-12-27T08:18:41.4027768Z drwxr-sr-x 43 root 123 4096 Dec 27 08:18 tools
2023-12-27T08:18:41.4029093Z drwxr-sr-x 4 root 123 4096 Dec 27 08:18 usr
2023-12-27T08:18:41.4030420Z drwxr-sr-x 4 root 123 4096 Dec 27 08:18 virt
2023-12-27T08:18:41.4073086Z ##[debug]BT Command.RunScriptStep post [object Object]
2023-12-27T08:18:41.4932244Z ##[debug]Finishing: Run ls -l
Hanging step:
2023-12-27T08:18:41.5979154Z ##[group]Run bash .github/scripts/patches.sh |& tee /build/run.log
2023-12-27T08:18:41.5980856Z ESC[36;1mbash .github/scripts/patches.sh |& tee /build/run.logESC[0m
2023-12-27T08:18:41.5982659Z shell: bash -leo pipefail {0}
2023-12-27T08:18:41.5983477Z ##[endgroup]
2023-12-27T08:18:41.5987726Z ##[debug]Overwrite 'shell' base on job defaults.
2023-12-27T08:18:41.6050359Z ##[debug]bash -leo pipefail /__w/_temp/ed221d29-fb43-40f8-a3a5-515cb54834ef.sh
2023-12-27T08:18:41.6725938Z ##[group]Run '/home/runner/k8s/index.js'
2023-12-27T08:18:41.6729283Z shell: /home/runner/externals/node16/bin/node {0}
2023-12-27T08:18:41.6733009Z ##[endgroup]
2023-12-27T08:18:41.6783784Z ##[debug]/home/runner/externals/node16/bin/node /home/runner/k8s/index.js
2023-12-27T08:18:43.1840016Z ##[debug]BT woop run [object Object] run_script_step
2023-12-27T08:18:43.1844066Z ##[debug]BT Command.RunScriptStep pre [object Object]
2023-12-27T08:18:43.1898609Z ##[debug]BT args
2023-12-27T08:18:43.1903980Z sh
2023-12-27T08:18:43.1905849Z -e
2023-12-27T08:18:43.1906888Z /__w/_temp/8c576e10-a490-11ee-962b-ad34d67a6ae7.sh
2023-12-27T08:18:43.1908466Z ##[debug]BT args end
2023-12-27T08:18:43.2769477Z + set -euo pipefail
2023-12-27T08:18:43.2773362Z ++ dirname .github/scripts/patches.sh
2023-12-27T08:18:43.2775044Z + d=.github/scripts
2023-12-27T08:18:43.2775925Z + . .github/scripts/helpers.sh
2023-12-27T08:18:43.2814751Z ++ git log -1 --pretty=%H .github/scripts/helpers.sh
2023-12-27T08:18:43.2959880Z + basesha=972ec36094a51b0f856ad707e6513db54e2135ea
2023-12-27T08:18:43.2962454Z + patches=($(git rev-list --reverse ${basesha}..HEAD))
2023-12-27T08:18:43.2963666Z ++ git rev-list --reverse 972ec36094a51b0f856ad707e6513db54e2135ea..HEAD
2023-12-27T08:18:43.3124806Z + patch_tot=2
2023-12-27T08:18:43.3125820Z + rc=0
2023-12-27T08:18:43.3126293Z + cnt=1
2023-12-27T08:18:43.3157369Z + parallel -j 3 --colsep=, bash .github/scripts/patch_tester.sh '{1}' '{2}' '{3}' :::: /dev/fd/63
2023-12-27T08:18:43.3159113Z ++ for i in "${patches[@]}"
2023-12-27T08:18:43.3160225Z ++ echo de24a958dd1ef6a9347d4645084756f55f29455f,1,2
2023-12-27T08:18:43.3161277Z ++ cnt=2
2023-12-27T08:18:43.3161911Z ++ for i in "${patches[@]}"
2023-12-27T08:18:43.3162996Z ++ echo 6ef5e665c4a16eb41cf5a2c512c8a3caf82d8cae,2,2
2023-12-27T08:18:43.3164094Z ++ cnt=3
Never get to the "BT... post" log, and no exceptions.
Did you try executing the script manually with these inputs? It does look from the log that the script hangs, but is it possible that the script does not exit with these inputs? Can you please try executing any other simple script? This does not indicate any problem with run script step especially since web socket does not throw. Please let me know if other script executes successfully.
@nikola-jokic
Did you try executing the script manually with these inputs? It does look from the log that the script hangs, but is it possible that the script does not exit with these inputs?
I kubectl-exec into the pod/container, and verify that the script has finished successfully (it's not running, and the logs from the script contain correct output), but for some reason the execPod never returns. Note that this script/step works when running a dedicated runner, but w/o ARC/K8s
Please let me know if other script executes successfully.
A simpler script, e.g. https://github.com/bjoto/linux/actions/runs/7356297121/job/20026212978 completes from a runner perspective.
I have tried not using the promise, but inspecting the source code here does seem like we should only catch the websocket error the way I did it in the previous comment... Did you try changing the tty parameter? If the script works on the regular runner, is it possible that resource requirements are a problem for the job pod? Can you monitor the job pod to see if there are any interruptions?
Progress! The step that hangs doesn't provide any output to stdout/stderr for ~1.5h. So, I tried adding a keepalive:
(while true ; do sleep 2; echo .; done) &
progress=$!
# 1.5h work
kill $progress
And the step succeeds! So it seems like the exec socket times out, without notifying the caller, similar to what SSH can experience. This seems to be a workaround -- ugly -- but still.
To summarize; This seems to be related to how kubectl-exec behaves. The workaround above (making sure that the channel is active) is sufficient for me to move forward with using ARC.
A quick goolging for kubectl-exec timeouts for long running jobs is not super uncommon. Potential fixes, like changing the kubelet config (https://kubernetes.io/docs/reference/config-api/kubelet-config.v1beta1 streamingConnectionIdleTimeout and similar) might work, but AFAIU this is not something that can be easily changed on managed k8s like GKE/Autopilot.
@nikola-jokic I'm open to closing this issue for now. I think that as long as we're depending on kubectl-exec we can experience issues like this. :-(
I agree with you, this is really related to kubernetes and node client that we depend on... If the websocket does not throw, and callback is never called, there is nothing we can do on the hook. :disappointed:
I'm running ARC in containerMode: kubernetes, and for some workflow steps, the step never completes.
This is deployed on GKE (Google K8s Engine) with Autopilot.
Workflow:
The last
Run checks
step never completes. I've done a build of the Runner that has additional logging, and I can verify thatCommand.RunScriptStep
inpackages/k8s/src/index.ts
never completes -- i.e. the hook script/home/runner/k8s/index.js
never returns.When I exec into the workflow container, I can verify that the script has completed, and is not running any more.
The
Run check
step takes ~1h to complete.Expected:
Helm setup: