brimdata / super

A novel data lake based on super-structured data
https://zed.brimdata.io/
BSD 3-Clause "New" or "Revised" License
1.39k stars 64 forks source link

"panic: index out of range" on attempted join #5097

Open philrz opened 6 months ago

philrz commented 6 months ago

Repro is with GA Zed tagged v1.15.0 and the test data at s3://zed-issues/5097/all.zng.gz, as it's too large to attach to a GitHub Issue.

Save this program.zed:

from all
| _path=="notice"
| cut resp:=id.resp_h
| sort
| uniq
| inner join (
  from all
) on resp=id.resp_h foo:=uid

Now start a local zed serve, then load the test data and run the query as shown below. (For some reason I've not been able to repro the problem with zq or querying the lake directly at a filesystem location).

$ zed -version
Version: v1.15.0

$ zed create -use all
pool created: all 2ebe3H1zVKSKpKklFtzaWnh5xUn
Switched to branch "main" on pool "all"

$ zed load all.zng.gz 
(1/1) 51.69MB/51.69MB 15.85MB/s 100.00%
2ebe3r5ROZyKNgKiPzLLeSgUlg4 committed

$ zed query -I program.zed 

$ echo $?
0

At this point we find the zed serve has crashed with this panic dump:

panic: runtime error: index out of range [3] with length 3

goroutine 206 [running]:
github.com/brimdata/zed/runtime/sam/expr.(*DotExpr).Eval(0xc00f492880, {0x2a742e0, 0x397bba0}, {{0x2a7c9b8?, 0xc00f141d70?}, 0xc001b388dd?, 0x4bbe8f18?})
    /Users/phil/work/zed/runtime/sam/expr/dot.go:50 +0x24c
github.com/brimdata/zed/runtime/sam/expr.(*DotExpr).Eval(0xc00f4928c0, {0x2a742e0, 0x397bba0}, {{0x2a7c9b8?, 0xc00f141d70?}, 0xc001b388dd?, 0x3978728?})
    /Users/phil/work/zed/runtime/sam/expr/dot.go:42 +0x58
github.com/brimdata/zed/runtime/sam/expr.(*missingAsNull).Eval(0x2a7c918?, {0x2a742e0?, 0x397bba0?}, {{0x2a7c9b8, 0xc00f141d70}, 0xc001b388dd, 0x1a})
    /Users/phil/work/zed/runtime/sam/expr/sort.go:134 +0x3d
github.com/brimdata/zed/runtime/sam/expr.(*Comparator).Compare(0xc00f7c62a0, {{0x2a7c9b8?, 0xc00f77c9f0?}, 0xc004aa794a?, 0x0?}, {{0x2a7c9b8?, 0xc00f141d70?}, 0xc001b388dd?, 0x165b622?})
    /Users/phil/work/zed/runtime/sam/expr/sort.go:149 +0x1ee
github.com/brimdata/zed/runtime/sam/op/spill.(*MergeSort).Less(0xc00f3501e0, 0x1, 0x0)
    /Users/phil/work/zed/runtime/sam/op/spill/merge.go:142 +0x7d
container/heap.down({0x2a8f4e0, 0xc00f3501e0}, 0x0, 0x2)
    /usr/local/opt/go@1.21/libexec/src/container/heap/heap.go:111 +0xef
container/heap.Fix({0x2a8f4e0, 0xc00f3501e0}, 0x2a7c9b8?)
    /usr/local/opt/go@1.21/libexec/src/container/heap/heap.go:84 +0x3d
github.com/brimdata/zed/runtime/sam/op/spill.(*MergeSort).Read(0xc00f3501e0)
    /Users/phil/work/zed/runtime/sam/op/spill/merge.go:127 +0x59
github.com/brimdata/zed/zbuf.(*puller).Pull(0xc01535bef8, 0x50)
    /Users/phil/work/zed/zbuf/batch.go:85 +0x4e
github.com/brimdata/zed/runtime/sam/op/sort.(*Op).sendSpills(0xc0001753b0, 0x2a8aae8?)
    /Users/phil/work/zed/runtime/sam/op/sort/sort.go:162 +0x5a
github.com/brimdata/zed/runtime/sam/op/sort.(*Op).run(0xc0001753b0)
    /Users/phil/work/zed/runtime/sam/op/sort/sort.go:107 +0x2ee
created by github.com/brimdata/zed/runtime/sam/op/sort.(*Op).Pull.func1 in goroutine 205
    /Users/phil/work/zed/runtime/sam/op/sort/sort.go:48 +0x66

In addition to the panic, I'm somewhat concerned that the zed query simply produced no output and had an exit code of 0 implying success. I respect that there may be catastrophic failures like kill -9 in which the zed serve process would not be given a chance to say anything meaningful as it dies, but in this case since it was able to produce the panic dump at the console, is there some way it could have also returned enough info back to the client to imply something went wrong?

FYI, this bug does not seem new. I went through several releases all the way back to Zed v1.10.0 and these repro steps produced a panic in each one.

philrz commented 6 months ago

In a group discussion @nwt mentioned that the problem of the successful exit code is a known problem that shows up in a couple operators (join and sort were the two mentioned, I think) where a goroutine is used and the plumbing is not present to gracefully handle a panic. He explained that it's not a hard thing to fix, but it's kind of messy, which is why it's not been taken up yet. If the primary symptom of this issue is addressed separately, when we close this issue I can open an issue as a reminder to address the more general problem with handling the panics.

philrz commented 1 month ago

Verified in Zed commit 1ffe14f.

I happened to be working with a community user that mentioned a join crash they'd seen recently but were having difficulty reproducing. That led me to check for open bugs and, upon spotting this one, I did a binary search and confirmed it actually got fixed by the changes in #5239. Repeating the repro steps, we see the query now completes successfully without panic.

$ zed -version
Version: v1.17.0-48-g1ffe14f6

$ zed create -use all
pool created: all 2ltfe74xwOimYPbbuZqPWCe1ifr
Switched to branch "main" on pool "all"

$ zed load all.zng.gz
(1/1) 51.69MB/51.69MB 16.35MB/s 100.00%
2ltfej0g9ljMypTk5e8BNtiEowD committed

$ zed query -I program.zed
{resp:2.22.42.85,foo:"CvNzR22iDIjXuSwJyl"}
{resp:2.22.42.85,foo:"CvNzR22iDIjXuSwJyl"}
...
{resp:null(ip),foo:null(string)}
{resp:null(ip),foo:null(string)}

I've opened separate issue #5270 to track the secondary issue described above in https://github.com/brimdata/zed/issues/5097#issuecomment-2037867789.

Thanks @mattnibs!

philrz commented 1 month ago

Now that Arena has been backed out (#5278) I've been re-verifying issues that were fixed in the Arena era to make sure that prior issues haven't resurfaced, and it looks like this one has. Using the same repro steps shown above, here's this one happening once again now at current tip of main Zed commit 4a1ebf5. (For some reason, I'm now finding I often need to run the query twice to get a reliable crash, though in the one captured below I hit it on the first try.)

$ zed -version
Version: v1.17.0-64-g4a1ebf5b

$ zed create -use all
pool created: all 2mAZymARK4LOszZN5XIDCByaXIg
Switched to branch "main" on pool "all"

$ zed load all.zng.gz
(1/1) 51.69MB/51.69MB 17.27MB/s 100.00%
2mAZzUIE9GKtPhYzKY0VyPepUth committed

$ zed query -I program.zed

$ echo $?
0

The dump that appears when zed serve exits:

panic: runtime error: index out of range [3] with length 3

goroutine 202 [running]:
github.com/brimdata/zed/runtime/sam/expr.(*DotExpr).Eval(0xc00c692f00, {0x2b30ae0, 0x3a85260}, {{0x2b38e30?, 0xc019f7d710?}, 0xc000c9c8dd?, 0x4b8503e8?})
    /Users/phil/work/zed/runtime/sam/expr/dot.go:50 +0x24c
github.com/brimdata/zed/runtime/sam/expr.(*DotExpr).Eval(0xc00c692f40, {0x2b30ae0, 0x3a85260}, {{0x2b38e30?, 0xc019f7d710?}, 0xc000c9c8dd?, 0xc00c582258?})
    /Users/phil/work/zed/runtime/sam/expr/dot.go:42 +0x58
github.com/brimdata/zed/runtime/sam/expr.(*missingAsNull).Eval(0x2b38d90?, {0x2b30ae0?, 0x3a85260?}, {{0x2b38e30, 0xc019f7d710}, 0xc000c9c8dd, 0x1a})
    /Users/phil/work/zed/runtime/sam/expr/sort.go:143 +0x3d
github.com/brimdata/zed/runtime/sam/expr.(*Comparator).Compare(0xc00e132000, {{0x2b38e30?, 0xc00a4c00f0?}, 0xc00495f94a?, 0x0?}, {{0x2b38e30?, 0xc019f7d710?}, 0xc000c9c8dd?, 0x1660242?})
    /Users/phil/work/zed/runtime/sam/expr/sort.go:155 +0x1b6
github.com/brimdata/zed/runtime/sam/op/spill.(*MergeSort).Less(0xc01bee8050, 0x1, 0x0)
    /Users/phil/work/zed/runtime/sam/op/spill/merge.go:142 +0x7d
container/heap.down({0x2b4cde0, 0xc01bee8050}, 0x0, 0x2)
    /usr/local/opt/go@1.21/libexec/src/container/heap/heap.go:111 +0xef
container/heap.Fix({0x2b4cde0, 0xc01bee8050}, 0x2b38e30?)
    /usr/local/opt/go@1.21/libexec/src/container/heap/heap.go:84 +0x3d
github.com/brimdata/zed/runtime/sam/op/spill.(*MergeSort).Read(0xc01bee8050)
    /Users/phil/work/zed/runtime/sam/op/spill/merge.go:127 +0x59
github.com/brimdata/zed/zbuf.(*puller).Pull(0xc00de81ef8, 0x38?)
    /Users/phil/work/zed/zbuf/batch.go:88 +0x75
github.com/brimdata/zed/runtime/sam/op/sort.(*Op).sendSpills(0xc00c690d80, 0x2b48308?)
    /Users/phil/work/zed/runtime/sam/op/sort/sort.go:164 +0x5a
github.com/brimdata/zed/runtime/sam/op/sort.(*Op).run(0xc00c690d80)
    /Users/phil/work/zed/runtime/sam/op/sort/sort.go:109 +0x2ee
created by github.com/brimdata/zed/runtime/sam/op/sort.(*Op).Pull.func1 in goroutine 201
    /Users/phil/work/zed/runtime/sam/op/sort/sort.go:50 +0x66