CI hang during "yarn --inline-builds"

philrz commented 4 months ago

In the time after #2972 merged, I've started to notice a repeating intermittent failure in CI during the "Setup Zui" Actions Workflow in the yarn --inline-builds step. For example, here's how it looked in the most recent failure which was during a "Create Insiders Release" run on macOS:

Link step
  ➤ YN0007: │ @swc/core@npm:1.3.41 must be built because it never has been before or the last one failed
  ➤ YN0007: │ esbuild@npm:0.17.12 must be built because it never has been before or the last one failed
  ➤ YN0007: │ esbuild@npm:0.18.14 must be built because it never has been before or the last one failed
  ➤ YN0007: │ @parcel/watcher@npm:2.0.4 must be built because it never has been before or the last one failed
  ➤ YN0007: │ brimcap@https://github.com/brimdata/brimcap.git#commit=bf7fb4996738767bb4f27eee939ec67dd21aab52 must be built because it never has been before or the last one failed
  ➤ YN0007: │ electron@npm:28.0.0 must be built because it never has been before or the last one failed
  ➤ YN0007: │ keytar@npm:7.7.0 must be built because it never has been before or the last one failed
  ➤ YN0007: │ msw@npm:0.36.8 must be built because it never has been before or the last one failed
  ➤ YN0007: │ styled-components@npm:5.3.5 [cb7c7] must be built because it never has been before or the last one failed
  ➤ YN0007: │ playwright-chromium@npm:1.41.1 must be built because it never has been before or the last one failed
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT Downloading Chromium 121.0.6[167](https://github.com/brimdata/zui/actions/runs/8842643153/job/24281616674#step:3:177).57 (playwright build v1097) from https://playwright.azureedge.net/builds/chromium/1097/chromium-mac.zip
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |                                                                                |   0% of 139.7 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |■■■■■■■■                                                                        |  10% of 139.7 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |■■■■■■■■■■■■■■■■                                                                |  20% of 139.7 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |■■■■■■■■■■■■■■■■■■■■■■■■                                                        |  30% of 139.7 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■                                                |  40% of 139.7 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■                                        |  50% of 139.7 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■                                |  60% of 139.7 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■                        |  70% of 139.7 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■                |  80% of 139.7 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■        |  90% of 139.7 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■| 100% of 139.7 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT Chromium 121.0.6167.57 (playwright build v1097) downloaded to /Users/runner/Library/Caches/ms-playwright/chromium-1097
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT Downloading FFMPEG playwright build v1009 from https://playwright.azureedge.net/builds/ffmpeg/1009/ffmpeg-mac.zip
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |                                                                                |   1% of 1.1 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |■■■■■■■■                                                                        |  11% of 1.1 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |■■■■■■■■■■■■■■■■                                                                |  20% of 1.1 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |■■■■■■■■■■■■■■■■■■■■■■■■                                                        |  30% of 1.1 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■                                                |  40% of 1.1 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■                                        |  51% of 1.1 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■                                |  60% of 1.1 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■                        |  70% of 1.1 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■                |  80% of 1.1 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■        |  91% of 1.1 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■| 100% of 1.1 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT FFMPEG playwright build v1009 downloaded to /Users/runner/Library/Caches/ms-playwright/ffmpeg-1009
  Error: The operation was canceled.

The job ultimately gets killed by Actions after hanging for 6 hours.

By comparison, here's how the same step proceeded to completion on the prior successful Create Insiders Release run on macOS:

Link step
  ➤ YN0007: │ @swc/core@npm:1.3.41 must be built because it never has been before or the last one failed
  ➤ YN0007: │ esbuild@npm:0.17.12 must be built because it never has been before or the last one failed
  ➤ YN0007: │ esbuild@npm:0.18.14 must be built because it never has been before or the last one failed
  ➤ YN0007: │ @parcel/watcher@npm:2.0.4 must be built because it never has been before or the last one failed
  ➤ YN0007: │ brimcap@https://github.com/brimdata/brimcap.git#commit=bf7fb4996738767bb4f27eee939ec67dd21aab52 must be built because it never has been before or the last one failed
  ➤ YN0007: │ electron@npm:28.0.0 must be built because it never has been before or the last one failed
  ➤ YN0007: │ keytar@npm:7.7.0 must be built because it never has been before or the last one failed
  ➤ YN0007: │ msw@npm:0.36.8 must be built because it never has been before or the last one failed
  ➤ YN0007: │ styled-components@npm:5.3.5 [cb7c7] must be built because it never has been before or the last one failed
  ➤ YN0007: │ playwright-chromium@npm:1.41.1 must be built because it never has been before or the last one failed
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT Downloading Chromium 121.0.6167.57 (playwright build v1097) from https://playwright.azureedge.net/builds/chromium/1097/chromium-mac.zip
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |                                                                                |   0% of 139.7 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |■■■■■■■■                                                                        |  10% of 139.7 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |■■■■■■■■■■■■■■■■                                                                |  20% of 139.7 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |■■■■■■■■■■■■■■■■■■■■■■■■                                                        |  30% of 139.7 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■                                                |  40% of 139.7 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■                                        |  50% of 139.7 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■                                |  60% of 139.7 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■                        |  70% of 139.7 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■                |  80% of 139.7 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■        |  90% of 139.7 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■| 100% of 139.7 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT Chromium 121.0.6[167](https://github.com/brimdata/zui/actions/runs/8810538623/job/24183065715#step:3:177).57 (playwright build v1097) downloaded to /Users/runner/Library/Caches/ms-playwright/chromium-1097
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT Downloading FFMPEG playwright build v1009 from https://playwright.azureedge.net/builds/ffmpeg/1009/ffmpeg-mac.zip
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |                                                                                |   1% of 1.1 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |■■■■■■■■                                                                        |  11% of 1.1 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |■■■■■■■■■■■■■■■■                                                                |  20% of 1.1 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |■■■■■■■■■■■■■■■■■■■■■■■■                                                        |  30% of 1.1 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■                                                |  40% of 1.1 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■                                        |  51% of 1.1 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■                                |  60% of 1.1 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■                        |  70% of 1.1 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■                |  80% of 1.1 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■        |  91% of 1.1 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■| 100% of 1.1 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT FFMPEG playwright build v1009 downloaded to /Users/runner/Library/Caches/ms-playwright/ffmpeg-1009
  ➤ YN0007: │ zui@workspace:apps/zui must be built because it never has been before or the last one failed
  ➤ YN0007: │ nx@npm:16.10.0 [a0cba] must be built because it never has been before or the last one failed
  ➤ YN0000: │ zui@workspace:apps/zui STDOUT Downloading zdeps (skip with ZDEPS=false yarn)
  ➤ YN0000: │ zui@workspace:apps/zui STDOUT copied brimcap artifacts Version: v1.7.0
  ➤ YN0000: │ zui@workspace:apps/zui STDOUT copied Zed artifacts Version: v1.15.0-12-gec3f004c

I have no idea what it's hanging on in the failure case, but at minimum I wanted to open this bug to have a place to start logging incidents to look for patterns. I expect I'll start looking into debug approaches, such as maybe increasing log verbosity and/or running the steps manually/interactively on the runner so I can watch top and see if I can catch a hanging process.

Looking back over recent Actions runs, here's other incidents I see of the same symptom.

Run	Workflow	Platform
https://github.com/brimdata/zui/actions/runs/8804933973	Zui CI	macOS
https://github.com/brimdata/zui/actions/runs/8443742929	Zui CI	macOS
https://github.com/brimdata/zui/actions/runs/8851891223	Zui CI	macOS
https://github.com/brimdata/zui/actions/runs/8854325998	Zui CI	macOS

In conclusion, thus far it's only been observed on macOS, though we're still looking at small numbers.

philrz commented 4 months ago

Yesterday I did a bunch of runs to try to establish repro patterns and see if I could catch it in the act. Angles I pursued and findings:

I did 40 separate runs of "Zui CI" at commit d7c1c3a, out of which it failed 2 times with the symptom shown above. I've added these incidents to the bottom of the table in the issue's opening text.
I did a looping repro attempt on my Macbook using the following script that covers the same commands leading up to the repro in CI. It ran successfully 404 times without hanging before I stopped it for the night.

#!/bin/bash
NUM=1
while true; do
  echo "Run #: $NUM"
  echo "=============="
  git clone https://github.com/brimdata/zui.git
  cd zui
  nvm use $(cat .node-version)
  yarn --inline-builds
  cd ..
  rm -rf zui
  NUM=$(expr $NUM + 1)
done

I looped the same repro script after having used ngrok-ssh to start an interactive login with a GitHub Actions runner running macos-12 (just as "Zui CI" runs on). It ran successfully 221 times without hanging before I stopped it for the night.
I did 20 separate one-at-a-time ngrok-ssh logins to GitHub Actions macos-12 runners, each executing the same commands below leading up to the repro in CI. It did not hang once.

git clone https://github.com/brimdata/zui.git
cd zui
nvm install $(cat .node-version)
nvm use $(cat .node-version)
yarn --inline-builds

This all unfortunately doesn't do a whole lot to narrow it down. It does look pretty certain that the repro is unique to macOS. The fact it didn't repro through a loop locally on my Macbook nor on a single Actions Runner leads me to speculate that the essential ingredients for repro could be:

Somehow related to the job landing on a "bad" runner. That said, the experiences I've had with "bad" runners in the past were usually more random and on-off (e.g., failures to load cache, network drops, etc.) and not a symptom like this with quiet hangs in the same spot.
Something about the workflow setup other than the essential commands in my looping script. This hang is so early in the job that there's not a whole lot it could be, but if I want to be meticulous, there's things I could correct for like how it does the Go/Node installations through separate workflow steps or that it runs jongwooo/next-cache@v1.

The fact I wasn't able to catch it with my 20 manual ngrok-ssh interactive logins is disappointing, but the low repro rate makes that not altogether unsurprising, and if there's something specific about it being run as the a "Zui CI" job I might be wasting my time with that approach. For my next attempts I'll look at grafting nrgok-ssh onto "Zui CI" in hopes I can catch it that way.

philrz commented 4 months ago

I've got a branch zui-3059-debug rigged up to start ngrok-ssh before doing the rest of the "Zui CI" setup steps. I ran it 41 times without it hanging once. It seems this problem doesn't want to be caught in the act, or it magically fixed itself. I'm going to pause chasing it for the moment and will resume if it starts flaring up again.

brimdata / zui

CI hang during "yarn --inline-builds" #3059