Open philrz opened 8 months ago
In a group discussion @nwt mentioned that the problem of the successful exit code is a known problem that shows up in a couple operators (join
and sort
were the two mentioned, I think) where a goroutine is used and the plumbing is not present to gracefully handle a panic. He explained that it's not a hard thing to fix, but it's kind of messy, which is why it's not been taken up yet. If the primary symptom of this issue is addressed separately, when we close this issue I can open an issue as a reminder to address the more general problem with handling the panics.
Verified in Zed commit 1ffe14f.
I happened to be working with a community user that mentioned a join
crash they'd seen recently but were having difficulty reproducing. That led me to check for open bugs and, upon spotting this one, I did a binary search and confirmed it actually got fixed by the changes in #5239. Repeating the repro steps, we see the query now completes successfully without panic.
$ zed -version
Version: v1.17.0-48-g1ffe14f6
$ zed create -use all
pool created: all 2ltfe74xwOimYPbbuZqPWCe1ifr
Switched to branch "main" on pool "all"
$ zed load all.zng.gz
(1/1) 51.69MB/51.69MB 16.35MB/s 100.00%
2ltfej0g9ljMypTk5e8BNtiEowD committed
$ zed query -I program.zed
{resp:2.22.42.85,foo:"CvNzR22iDIjXuSwJyl"}
{resp:2.22.42.85,foo:"CvNzR22iDIjXuSwJyl"}
...
{resp:null(ip),foo:null(string)}
{resp:null(ip),foo:null(string)}
I've opened separate issue #5270 to track the secondary issue described above in https://github.com/brimdata/zed/issues/5097#issuecomment-2037867789.
Thanks @mattnibs!
Now that Arena has been backed out (#5278) I've been re-verifying issues that were fixed in the Arena era to make sure that prior issues haven't resurfaced, and it looks like this one has. Using the same repro steps shown above, here's this one happening once again now at current tip of main
Zed commit 4a1ebf5. (For some reason, I'm now finding I often need to run the query twice to get a reliable crash, though in the one captured below I hit it on the first try.)
$ zed -version
Version: v1.17.0-64-g4a1ebf5b
$ zed create -use all
pool created: all 2mAZymARK4LOszZN5XIDCByaXIg
Switched to branch "main" on pool "all"
$ zed load all.zng.gz
(1/1) 51.69MB/51.69MB 17.27MB/s 100.00%
2mAZzUIE9GKtPhYzKY0VyPepUth committed
$ zed query -I program.zed
$ echo $?
0
The dump that appears when zed serve
exits:
panic: runtime error: index out of range [3] with length 3
goroutine 202 [running]:
github.com/brimdata/zed/runtime/sam/expr.(*DotExpr).Eval(0xc00c692f00, {0x2b30ae0, 0x3a85260}, {{0x2b38e30?, 0xc019f7d710?}, 0xc000c9c8dd?, 0x4b8503e8?})
/Users/phil/work/zed/runtime/sam/expr/dot.go:50 +0x24c
github.com/brimdata/zed/runtime/sam/expr.(*DotExpr).Eval(0xc00c692f40, {0x2b30ae0, 0x3a85260}, {{0x2b38e30?, 0xc019f7d710?}, 0xc000c9c8dd?, 0xc00c582258?})
/Users/phil/work/zed/runtime/sam/expr/dot.go:42 +0x58
github.com/brimdata/zed/runtime/sam/expr.(*missingAsNull).Eval(0x2b38d90?, {0x2b30ae0?, 0x3a85260?}, {{0x2b38e30, 0xc019f7d710}, 0xc000c9c8dd, 0x1a})
/Users/phil/work/zed/runtime/sam/expr/sort.go:143 +0x3d
github.com/brimdata/zed/runtime/sam/expr.(*Comparator).Compare(0xc00e132000, {{0x2b38e30?, 0xc00a4c00f0?}, 0xc00495f94a?, 0x0?}, {{0x2b38e30?, 0xc019f7d710?}, 0xc000c9c8dd?, 0x1660242?})
/Users/phil/work/zed/runtime/sam/expr/sort.go:155 +0x1b6
github.com/brimdata/zed/runtime/sam/op/spill.(*MergeSort).Less(0xc01bee8050, 0x1, 0x0)
/Users/phil/work/zed/runtime/sam/op/spill/merge.go:142 +0x7d
container/heap.down({0x2b4cde0, 0xc01bee8050}, 0x0, 0x2)
/usr/local/opt/go@1.21/libexec/src/container/heap/heap.go:111 +0xef
container/heap.Fix({0x2b4cde0, 0xc01bee8050}, 0x2b38e30?)
/usr/local/opt/go@1.21/libexec/src/container/heap/heap.go:84 +0x3d
github.com/brimdata/zed/runtime/sam/op/spill.(*MergeSort).Read(0xc01bee8050)
/Users/phil/work/zed/runtime/sam/op/spill/merge.go:127 +0x59
github.com/brimdata/zed/zbuf.(*puller).Pull(0xc00de81ef8, 0x38?)
/Users/phil/work/zed/zbuf/batch.go:88 +0x75
github.com/brimdata/zed/runtime/sam/op/sort.(*Op).sendSpills(0xc00c690d80, 0x2b48308?)
/Users/phil/work/zed/runtime/sam/op/sort/sort.go:164 +0x5a
github.com/brimdata/zed/runtime/sam/op/sort.(*Op).run(0xc00c690d80)
/Users/phil/work/zed/runtime/sam/op/sort/sort.go:109 +0x2ee
created by github.com/brimdata/zed/runtime/sam/op/sort.(*Op).Pull.func1 in goroutine 201
/Users/phil/work/zed/runtime/sam/op/sort/sort.go:50 +0x66
Repro is with GA Zed tagged
v1.15.0
and the test data ats3://zed-issues/5097/all.zng.gz
, as it's too large to attach to a GitHub Issue.Save this
program.zed
:Now start a local
zed serve
, then load the test data and run the query as shown below. (For some reason I've not been able to repro the problem withzq
or querying the lake directly at a filesystem location).At this point we find the
zed serve
has crashed with this panic dump:In addition to the panic, I'm somewhat concerned that the
zed query
simply produced no output and had an exit code of0
implying success. I respect that there may be catastrophic failures likekill -9
in which thezed serve
process would not be given a chance to say anything meaningful as it dies, but in this case since it was able to produce the panic dump at the console, is there some way it could have also returned enough info back to the client to imply something went wrong?FYI, this bug does not seem new. I went through several releases all the way back to Zed
v1.10.0
and these repro steps produced a panic in each one.