brimdata / zed

A novel data lake based on super-structured data
https://zed.brimdata.io/
BSD 3-Clause "New" or "Revised" License
1.34k stars 67 forks source link

"panic: index out of range" on attempted join #5097

Open philrz opened 3 months ago

philrz commented 3 months ago

Repro is with GA Zed tagged v1.15.0 and the test data at s3://zed-issues/5097/all.zng.gz, as it's too large to attach to a GitHub Issue.

Save this program.zed:

from all
| _path=="notice"
| cut resp:=id.resp_h
| sort
| uniq
| inner join (
  from all
) on resp=id.resp_h foo:=uid

Now start a local zed serve, then load the test data and run the query as shown below. (For some reason I've not been able to repro the problem with zq or querying the lake directly at a filesystem location).

$ zed -version
Version: v1.15.0

$ zed create -use all
pool created: all 2ebe3H1zVKSKpKklFtzaWnh5xUn
Switched to branch "main" on pool "all"

$ zed load all.zng.gz 
(1/1) 51.69MB/51.69MB 15.85MB/s 100.00%
2ebe3r5ROZyKNgKiPzLLeSgUlg4 committed

$ zed query -I program.zed 

$ echo $?
0

At this point we find the zed serve has crashed with this panic dump:

panic: runtime error: index out of range [3] with length 3

goroutine 206 [running]:
github.com/brimdata/zed/runtime/sam/expr.(*DotExpr).Eval(0xc00f492880, {0x2a742e0, 0x397bba0}, {{0x2a7c9b8?, 0xc00f141d70?}, 0xc001b388dd?, 0x4bbe8f18?})
    /Users/phil/work/zed/runtime/sam/expr/dot.go:50 +0x24c
github.com/brimdata/zed/runtime/sam/expr.(*DotExpr).Eval(0xc00f4928c0, {0x2a742e0, 0x397bba0}, {{0x2a7c9b8?, 0xc00f141d70?}, 0xc001b388dd?, 0x3978728?})
    /Users/phil/work/zed/runtime/sam/expr/dot.go:42 +0x58
github.com/brimdata/zed/runtime/sam/expr.(*missingAsNull).Eval(0x2a7c918?, {0x2a742e0?, 0x397bba0?}, {{0x2a7c9b8, 0xc00f141d70}, 0xc001b388dd, 0x1a})
    /Users/phil/work/zed/runtime/sam/expr/sort.go:134 +0x3d
github.com/brimdata/zed/runtime/sam/expr.(*Comparator).Compare(0xc00f7c62a0, {{0x2a7c9b8?, 0xc00f77c9f0?}, 0xc004aa794a?, 0x0?}, {{0x2a7c9b8?, 0xc00f141d70?}, 0xc001b388dd?, 0x165b622?})
    /Users/phil/work/zed/runtime/sam/expr/sort.go:149 +0x1ee
github.com/brimdata/zed/runtime/sam/op/spill.(*MergeSort).Less(0xc00f3501e0, 0x1, 0x0)
    /Users/phil/work/zed/runtime/sam/op/spill/merge.go:142 +0x7d
container/heap.down({0x2a8f4e0, 0xc00f3501e0}, 0x0, 0x2)
    /usr/local/opt/go@1.21/libexec/src/container/heap/heap.go:111 +0xef
container/heap.Fix({0x2a8f4e0, 0xc00f3501e0}, 0x2a7c9b8?)
    /usr/local/opt/go@1.21/libexec/src/container/heap/heap.go:84 +0x3d
github.com/brimdata/zed/runtime/sam/op/spill.(*MergeSort).Read(0xc00f3501e0)
    /Users/phil/work/zed/runtime/sam/op/spill/merge.go:127 +0x59
github.com/brimdata/zed/zbuf.(*puller).Pull(0xc01535bef8, 0x50)
    /Users/phil/work/zed/zbuf/batch.go:85 +0x4e
github.com/brimdata/zed/runtime/sam/op/sort.(*Op).sendSpills(0xc0001753b0, 0x2a8aae8?)
    /Users/phil/work/zed/runtime/sam/op/sort/sort.go:162 +0x5a
github.com/brimdata/zed/runtime/sam/op/sort.(*Op).run(0xc0001753b0)
    /Users/phil/work/zed/runtime/sam/op/sort/sort.go:107 +0x2ee
created by github.com/brimdata/zed/runtime/sam/op/sort.(*Op).Pull.func1 in goroutine 205
    /Users/phil/work/zed/runtime/sam/op/sort/sort.go:48 +0x66

In addition to the panic, I'm somewhat concerned that the zed query simply produced no output and had an exit code of 0 implying success. I respect that there may be catastrophic failures like kill -9 in which the zed serve process would not be given a chance to say anything meaningful as it dies, but in this case since it was able to produce the panic dump at the console, is there some way it could have also returned enough info back to the client to imply something went wrong?

FYI, this bug does not seem new. I went through several releases all the way back to Zed v1.10.0 and these repro steps produced a panic in each one.

philrz commented 3 months ago

In a group discussion @nwt mentioned that the problem of the successful exit code is a known problem that shows up in a couple operators (join and sort were the two mentioned, I think) where a goroutine is used and the plumbing is not present to gracefully handle a panic. He explained that it's not a hard thing to fix, but it's kind of messy, which is why it's not been taken up yet. If the primary symptom of this issue is addressed separately, when we close this issue I can open an issue as a reminder to address the more general problem with handling the panics.