Open philrz opened 4 months ago
Yesterday I did a bunch of runs to try to establish repro patterns and see if I could catch it in the act. Angles I pursued and findings:
I did 40 separate runs of "Zui CI" at commit d7c1c3a, out of which it failed 2 times with the symptom shown above. I've added these incidents to the bottom of the table in the issue's opening text.
I did a looping repro attempt on my Macbook using the following script that covers the same commands leading up to the repro in CI. It ran successfully 404 times without hanging before I stopped it for the night.
#!/bin/bash
NUM=1
while true; do
echo "Run #: $NUM"
echo "=============="
git clone https://github.com/brimdata/zui.git
cd zui
nvm use $(cat .node-version)
yarn --inline-builds
cd ..
rm -rf zui
NUM=$(expr $NUM + 1)
done
I looped the same repro script after having used ngrok-ssh to start an interactive login with a GitHub Actions runner running macos-12
(just as "Zui CI" runs on). It ran successfully 221 times without hanging before I stopped it for the night.
I did 20 separate one-at-a-time ngrok-ssh logins to GitHub Actions macos-12
runners, each executing the same commands below leading up to the repro in CI. It did not hang once.
git clone https://github.com/brimdata/zui.git
cd zui
nvm install $(cat .node-version)
nvm use $(cat .node-version)
yarn --inline-builds
This all unfortunately doesn't do a whole lot to narrow it down. It does look pretty certain that the repro is unique to macOS. The fact it didn't repro through a loop locally on my Macbook nor on a single Actions Runner leads me to speculate that the essential ingredients for repro could be:
Somehow related to the job landing on a "bad" runner. That said, the experiences I've had with "bad" runners in the past were usually more random and on-off (e.g., failures to load cache, network drops, etc.) and not a symptom like this with quiet hangs in the same spot.
Something about the workflow setup other than the essential commands in my looping script. This hang is so early in the job that there's not a whole lot it could be, but if I want to be meticulous, there's things I could correct for like how it does the Go/Node installations through separate workflow steps or that it runs jongwooo/next-cache@v1
.
The fact I wasn't able to catch it with my 20 manual ngrok-ssh interactive logins is disappointing, but the low repro rate makes that not altogether unsurprising, and if there's something specific about it being run as the a "Zui CI" job I might be wasting my time with that approach. For my next attempts I'll look at grafting nrgok-ssh onto "Zui CI" in hopes I can catch it that way.
I've got a branch zui-3059-debug rigged up to start ngrok-ssh before doing the rest of the "Zui CI" setup steps. I ran it 41 times without it hanging once. It seems this problem doesn't want to be caught in the act, or it magically fixed itself. I'm going to pause chasing it for the moment and will resume if it starts flaring up again.
In the time after #2972 merged, I've started to notice a repeating intermittent failure in CI during the "Setup Zui" Actions Workflow in the
yarn --inline-builds
step. For example, here's how it looked in the most recent failure which was during a "Create Insiders Release" run on macOS:The job ultimately gets killed by Actions after hanging for 6 hours.
By comparison, here's how the same step proceeded to completion on the prior successful Create Insiders Release run on macOS:
I have no idea what it's hanging on in the failure case, but at minimum I wanted to open this bug to have a place to start logging incidents to look for patterns. I expect I'll start looking into debug approaches, such as maybe increasing log verbosity and/or running the steps manually/interactively on the runner so I can watch
top
and see if I can catch a hanging process.Looking back over recent Actions runs, here's other incidents I see of the same symptom.
In conclusion, thus far it's only been observed on macOS, though we're still looking at small numbers.