Error message from zq sometimes not surfaced

tl;dr

When Zui calls on zq in a way that's expected to fail (such as in Preview & Load when trying to read a file that is not of a format that Zed can read via auto-detect) the expected error message sometimes is not surfaced.

Details

Repro is with Zui commit 57afe04.

The repro in the video below is taken on a Linux VM run in VirtualBox. The file that's being opened in Zui is the same soccer-ball.png that's opened by this e2e test which has been described as "flaky":

https://github.com/brimdata/zui/blob/57afe0458ff84182d4e392aada10f1d4608911da/packages/zui-player/tests/pool-loads.spec.ts#L42-L46

In the video I begin the process to load the file in Preview & Load several times. While the auto-detect failure is shown in the app most times, you can see on both the second and final attempts that no error message is shown.

https://github.com/brimdata/zui/assets/5934157/767269db-3530-4031-a6a9-2edbd12338f0

What led me to repro this manually is that I'd been studying that "flaky test" shown above via a branch where I ran just that test repeatedly in CI and found it failed 24 times out of 100 in a manner similar to what I showed in my manual repro video.

Based on what's shown here I suspect this may share a root cause with a separate Jest test that's been known to fail occasionally in CI that's been on my list to study:

https://github.com/brimdata/zui/blob/57afe0458ff84182d4e392aada10f1d4608911da/packages/zed-node/src/zq.test.ts#L106-L115

On the same Linux VM where I did the repro/video above I triggered a failure via this loop:

$ num=1 ; success=0; fail=0; while true; do yarn nx run zed-node:test --skip-nx-cache; if [ $? -eq 0 ]; then success=$(expr $success + 1); else fail=$(expr $fail + 1); fi ; num=$(expr $num + 1) ; echo "success=$success fail=$fail" ; done

To get to a quicker repro, I deleted all the other tests so just this one "zq with a bad zed" test would run, and I also first started a cat /dev/random > /dev/null & for every core on my VM since keeping the system under load seems to force the repro quicker (presumably because heavy load aggravates timing issues). With this approach I got 3 repros out of 218 runs, with this being an example of a failure:

 FAIL   zed-node  zed-node/src/zq.test.ts
  ✕ zq with a bad zed  (112 ms)

  ● zq with a bad zed 

    expect(received).rejects.toThrowError()

    Received promise resolved instead of rejected
    Resolved to value: []

      16 |   });
      17 |
    > 18 |   await expect(promise).rejects.toThrowError('error parsing Zed');
         |         ^
      19 | });
      20 |

      at expect (../../node_modules/expect/build/index.js:105:15)
      at Object.<anonymous> (src/zq.test.ts:18:9)

The rejection of the promise appears to be a prerequisite for seeing this error, which leads me to guess that in my manual repro in Preview & Load the lack of visible error message is similarly due to the promise resolving like we see here.

Conclusion

When I reviewed the latter test failure with @jameskerr in the past he acknowledged that the code in zq.ts that deals with the callouts to zq is a little complex and probably tricky to debug. But now that we've confirmed a user-facing bug is reproducible and it's not just a "flaky test", it seems worthwhile to invest some cycles to look closer at this.

As an idea to further debug this, I've been testing on the zui-3021-debug branch where I've created a wrapper script around the calls to zq. I figured with this approach I could separate out the stdout and stderr returned from zq and then pass them upstream to the calling app. A couple things I thought this might enable:

I could log the stdout, stderr, and received command line arguments to tmp files that I could examine after successful/failed runs just to check for any unexpected differences.
I could play around with the order/timing in which stdout and stderr were passed upstream in an attempt to aggravate timing problems, in the event that's the root cause of the problem observed here.

tl;dr

The mere presence of the wrapper script seems to make the test pass consistently.

Details

https://github.com/brimdata/zui/actions/runs/8090373897 was a baseline run where I'd just created the branch based off a recent tip of main and had made just the changes in 8a2f27e to run only the bad data displays an error message e2e test in a loop with zq being invoked directly as normal. This happened to produce 24 failures out of 100 runs just like the previous baseline I'd described in the opening comments of this issue.

https://github.com/brimdata/zui/actions/runs/8102744996 was the run that started from that baseline and also included all changes shown here to introduce the wrapper script and have it called in lieu of directly calling zq. The test ran successfully all 100 times.

I'm not sure what conclusions to draw from that result. 🤔

brimdata / zui