brimdata / zed

A novel data lake based on super-structured data
https://zed.brimdata.io/
BSD 3-Clause "New" or "Revised" License
1.37k stars 67 forks source link

Query containing "uniq" causes lake service crash #5259

Closed philrz closed 1 day ago

philrz commented 1 week ago

tl;dr

After loading the zed-sample-data, this Zed program triggers a crash:

from "Zeek" | _path=="conn" | typeof(id) | uniq

Details

Repro is with Zed commit f8fb140. This problem was found by @jameskerr when using Zui.

To repro, first start a Zed lake service.

$ zed -version
Version: v1.17.0-47-gf8fb1406

$ zed -lake lake serve
{"level":"info","ts":1725472428.4225721,"logger":"core","msg":"Started","auth_enabled":false,"root":"file:///Users/phil/lake","version":"v1.17.0-47-gf8fb1406"}

Load the zed-sample-data, execute the query, and observe the stack trace. Sometimes it doesn't trigger on the first try, so I find looping the query saves time.

$ zed create -use Zeek
pool created: Zeek 2lcHiyLz4XwxJHkQ4Yshjq2M288
Switched to branch "main" on pool "Zeek"

$ zed load *.zng.gz
(35/35) 51.49MB/51.49MB 0B/s 100.00%
2lcHm1G7gkaoj8kUtLQuO6cacp3 committed

$ while true; do zed query 'from "Zeek" | _path=="conn" | typeof(id) | uniq'; done
<{orig_h:ip,orig_p:port=uint16,resp_h:ip,resp_p:port}>
<{orig_h:ip,orig_p:port=uint16,resp_h:ip,resp_p:port}>
<{orig_h:ip,orig_p:port=uint16,resp_h:ip,resp_p:port}>
<<ERR bad type ID in type value: type ID too large for primitive: 99>>
panic: runtime error: index out of range [403] with length 230
goroutine 449 [running]:
runtime/debug.Stack()
    /usr/local/opt/go@1.21/libexec/src/runtime/debug/stack.go:24 +0x5e
github.com/brimdata/zed/runtime/sam/op.(*Catcher).Pull.func1()
    /Users/phil/work/zed/runtime/sam/op/catcher.go:25 +0x3d
panic({0x27840a0?, 0xc013a6a000?})
    /usr/local/opt/go@1.21/libexec/src/runtime/panic.go:914 +0x21f
github.com/brimdata/zed.(*Arena).offsetAndLength(...)
    /Users/phil/work/zed/arena.go:216
github.com/brimdata/zed.(*Arena).bytes_(0x0?, 0xc0018deed0?)
    /Users/phil/work/zed/arena.go:195 +0x40d
github.com/brimdata/zed.Value.Bytes({0x100000c00604c240, 0x2000001c00000193})
    /Users/phil/work/zed/value.go:232 +0x20a
github.com/brimdata/zed/runtime/sam/op/uniq.(*Op).appendUniq(0xc002852d00, {0x0?, 0x0, 0x0}, {0xc00008ae70?, 0x1010f25?})
    /Users/phil/work/zed/runtime/sam/op/uniq/uniq.go:53 +0x85
github.com/brimdata/zed/runtime/sam/op/uniq.(*Op).Pull(0xc002852d00, 0x40?)
    /Users/phil/work/zed/runtime/sam/op/uniq/uniq.go:82 +0x265
github.com/brimdata/zed/runtime/sam/op.(*Single).Pull(0xc0028bc5d0, 0xa0)
    /Users/phil/work/zed/runtime/sam/op/mux.go:120 +0x33
github.com/brimdata/zed/runtime/sam/op.(*Catcher).Pull(0x1?, 0x60?)
    /Users/phil/work/zed/runtime/sam/op/catcher.go:28 +0x62
github.com/brimdata/zed/runtime/exec.(*Query).Pull(0xc0001d9e40?, 0x0?)
    /Users/phil/work/zed/runtime/exec/query.go:59 +0x3e
github.com/brimdata/zed/service.handleQuery.func2()
    /Users/phil/work/zed/service/handlers.go:92 +0x42
created by github.com/brimdata/zed/service.handleQuery in goroutine 525
    /Users/phil/work/zed/service/handlers.go:90 +0x753

For whatever reason, it seems to crash more easily when the equivalent load/query is done within Zui similar to how @jameskerr originally found it. The crash also can cause the Zed lake service to exit in that case, making this a pretty bad bug.

philrz commented 6 days ago

A user reported in a community Slack thread what may have been another sighting of this issue when they were working in Zui Insiders. In their own words:

Lol I accidentally did two cut | uniq | fuse in a row and that seems to stop the service

I attempted to repro using the small sample.zng test data by entering the query cut | uniq | fuse | cut | uniq | fuse but was unsuccessful. However, recalling this bug, I just confirmed in 1.17.1-insiders.22 that if I loaded the zed-sample-data, I could repro a crash with just uniq. So the fact the user's pipeline used uniq leads me to guess he might be hitting the same problem.

https://github.com/user-attachments/assets/d1e7e1a0-b480-4101-ab87-dcf140024f2b

FWIW, when I ran this variation against a local lake service at commit 7d8c575, I did get a different stack dump:

panic: runtime error: slice bounds out of range [:508169] with capacity 0

goroutine 613 [running]:
github.com/brimdata/zed.(*Arena).bytes_(0x1033229?, 0xc0021b6d20?)
    /Users/phil/work/zed/arena.go:191 +0x3dc
github.com/brimdata/zed.Value.Bytes({0x100000c007d50120, 0x400000270000118b})
    /Users/phil/work/zed/value.go:232 +0x20a
github.com/brimdata/zed/runtime/sam/expr.(*DotExpr).Eval(0xc00078bd40, {0x2b50a58?, 0xc00913ece0?}, {0x4e142e0?, 0xc0021b6d58?})
    /Users/phil/work/zed/runtime/sam/expr/dot.go:50 +0x21c
github.com/brimdata/zed/runtime/sam/expr.(*filter).Eval(0xc017a0fe18, {0x2b50a58?, 0xc00913ece0?}, {0xc0021b6d70?, 0x101e796?})
    /Users/phil/work/zed/runtime/sam/expr/filter.go:240 +0x31
github.com/brimdata/zed/runtime/sam/expr.EvalBool(0x16000?, {0x2b50a58, 0xc00913ece0}, {0x0?, 0xc0021b6e00?}, {0x2b4b1e0?, 0xc017a0fe18?})
    /Users/phil/work/zed/runtime/sam/expr/eval.go:71 +0x42
github.com/brimdata/zed/runtime/sam/expr.(*filterApplier).Eval(0xc017a0fe30, {0x2b50a58, 0xc00913ece0}, {0x0?, 0x1?})
    /Users/phil/work/zed/runtime/sam/expr/filter.go:260 +0x3e
github.com/brimdata/zed/runtime/sam/op.(*applier).Pull(0xc00078bd80, 0x88?)
    /Users/phil/work/zed/runtime/sam/op/apply.go:38 +0x28f
github.com/brimdata/zed/runtime/sam/op/groupby.(*Op).run(0xc00072be00)
    /Users/phil/work/zed/runtime/sam/op/groupby/groupby.go:211 +0xa6
created by github.com/brimdata/zed/runtime/sam/op/groupby.(*Op).Pull.func1 in goroutine 612
    /Users/phil/work/zed/runtime/sam/op/groupby/groupby.go:179 +0x66
nwt commented 1 day ago

Fixed in #5278.

philrz commented 1 day ago

Verified in Zed commit 4a1ebf5.

Repeating the repro steps, the looping query that previously triggered the crash now runs as long as I leave it without triggering a crash.

$ zed -version
Version: v1.17.0-64-g4a1ebf5b

$ zed create -use Zeek
pool created: Zeek 2mAcf5r7RFPERXW6ZTvGY0eeZcQ
Switched to branch "main" on pool "Zeek"

$ zed load *.zng.gz
(35/35) 51.49MB/51.49MB 0B/s 100.00%
2mAcgEjjcGGt5fDmLUpzLGHAb9E committed

$ while true; do zed query 'from "Zeek" | _path=="conn" | typeof(id) | uniq'; done
<{orig_h:ip,orig_p:port=uint16,resp_h:ip,resp_p:port}>
<{orig_h:ip,orig_p:port=uint16,resp_h:ip,resp_p:port}>
[...ran 100+ times before I stopped it...]

I also confirmed that Zui Insiders 1.17.1-insiders.26 which has the benefit of the same Zed fixes can also run the plain uniq repeatedly on the sample data instead of crashing as I showed in the previous video.

https://github.com/user-attachments/assets/61604cd7-d147-4ecb-850e-475d8d8fe693

Thanks @nwt!