Open philrz opened 6 months ago
As an idea to further debug this, I've been testing on the zui-3021-debug branch where I've created a wrapper script around the calls to zq
. I figured with this approach I could separate out the stdout and stderr returned from zq
and then pass them upstream to the calling app. A couple things I thought this might enable:
The mere presence of the wrapper script seems to make the test pass consistently.
https://github.com/brimdata/zui/actions/runs/8090373897 was a baseline run where I'd just created the branch based off a recent tip of main
and had made just the changes in 8a2f27e to run only the bad data displays an error message
e2e test in a loop with zq
being invoked directly as normal. This happened to produce 24 failures out of 100 runs just like the previous baseline I'd described in the opening comments of this issue.
https://github.com/brimdata/zui/actions/runs/8102744996 was the run that started from that baseline and also included all changes shown here to introduce the wrapper script and have it called in lieu of directly calling zq
. The test ran successfully all 100 times.
I'm not sure what conclusions to draw from that result. 🤔
tl;dr
When Zui calls on
zq
in a way that's expected to fail (such as in Preview & Load when trying to read a file that is not of a format that Zed can read via auto-detect) the expected error message sometimes is not surfaced.Details
Repro is with Zui commit 57afe04.
The repro in the video below is taken on a Linux VM run in VirtualBox. The file that's being opened in Zui is the same soccer-ball.png that's opened by this e2e test which has been described as "flaky":
https://github.com/brimdata/zui/blob/57afe0458ff84182d4e392aada10f1d4608911da/packages/zui-player/tests/pool-loads.spec.ts#L42-L46
In the video I begin the process to load the file in Preview & Load several times. While the auto-detect failure is shown in the app most times, you can see on both the second and final attempts that no error message is shown.
https://github.com/brimdata/zui/assets/5934157/767269db-3530-4031-a6a9-2edbd12338f0
What led me to repro this manually is that I'd been studying that "flaky test" shown above via a branch where I ran just that test repeatedly in CI and found it failed 24 times out of 100 in a manner similar to what I showed in my manual repro video.
Based on what's shown here I suspect this may share a root cause with a separate Jest test that's been known to fail occasionally in CI that's been on my list to study:
https://github.com/brimdata/zui/blob/57afe0458ff84182d4e392aada10f1d4608911da/packages/zed-node/src/zq.test.ts#L106-L115
On the same Linux VM where I did the repro/video above I triggered a failure via this loop:
To get to a quicker repro, I deleted all the other tests so just this one "zq with a bad zed" test would run, and I also first started a
cat /dev/random > /dev/null &
for every core on my VM since keeping the system under load seems to force the repro quicker (presumably because heavy load aggravates timing issues). With this approach I got 3 repros out of 218 runs, with this being an example of a failure:The rejection of the promise appears to be a prerequisite for seeing this error, which leads me to guess that in my manual repro in Preview & Load the lack of visible error message is similarly due to the promise resolving like we see here.
Conclusion
When I reviewed the latter test failure with @jameskerr in the past he acknowledged that the code in zq.ts that deals with the callouts to
zq
is a little complex and probably tricky to debug. But now that we've confirmed a user-facing bug is reproducible and it's not just a "flaky test", it seems worthwhile to invest some cycles to look closer at this.