brimdata / super

A novel data lake based on super-structured data
https://zed.brimdata.io/
BSD 3-Clause "New" or "Revised" License
1.39k stars 64 forks source link

Duplicate type IDs with aliases and stream resets #866

Closed henridf closed 4 years ago

henridf commented 4 years ago

A groupby query on a zqd-imported space results in duplicate groups due to spurious type IDs. (The same query run on a all.zng obtained by manually running zq ~/Downloads/sampledata/corelight19/zeek-logs/*.log > ./all.zng does not show this bug.)

#zenum=string
#0:record[proto:zenum,count:uint64]
0:[tcp;20968;]
0:[udp;35488;]
0:[icmp;1048;]
#zenum=string
#1:record[proto:zenum,count:uint64]
1:[tcp;22328;]
1:[udp;19736;]
1:[icmp;520;]
#zenum=string
#2:record[proto:zenum,count:uint64]
2:[tcp;256860;]
2:[udp;483354;]
2:[icmp;12120;]
#zenum=string
...
henridf commented 4 years ago

This is an issue with aliases and stream type resets. Here's a much simpler repro of the bug, followed by a minor variation where the bug is absent. The only difference between both cases is that order of the first two input records is inverted. The reason for the bug is that in the first case the alias types get different IDs in the different streams, and our translation scheme does not recognize that they are identical. The issue does not occur in the second example because the aliases happen to get the same IDs.

incorrect:

~/work/brim/zq(master)
$ cat bug.tzng
#0:record[ts:time]
0:[1425565512.963801;]
#zenum=string
#1:record[ts:time,proto:zenum]
1:[1425565514.419939;udp;]
1:[1425565514.419939;udp;]
~/work/brim/zq(master)
$ zq -b 2 bug.tzng | zq -t "count() by proto" -
#zenum=string
#0:record[proto:zenum,count:uint64]
0:[udp;1;]
#zenum=string
#1:record[proto:zenum,count:uint64]
1:[udp;1;]

correct:

~/work/brim/zq(master)
$ cat nobug.tzng
#zenum=string
#0:record[ts:time,proto:zenum]
0:[1425565514.419939;udp;]
#1:record[ts:time]
1:[1425565512.963801;]
0:[1425565514.419939;udp;]
~/work/brim/zq(master)
$ zq -b 2 nobug.tzng | zq -t "count() by proto" -
#zenum=string
#0:record[proto:zenum,count:uint64]
0:[udp;2;]
henridf commented 4 years ago

An even simpler repro without groupby:

$ echo "#0:record[ts:time]
0:[1425565512.963801;]
#zenum=string
#1:record[ts:time,proto:zenum]
1:[1425565514.419939;udp;]
1:[1425565514.419939;udp;]
" | zq -b 2 - | zq -t -
#0:record[ts:time]
0:[1425565512.963801;]
#zenum=string
#1:record[ts:time,proto:zenum]
1:[1425565514.419939;udp;]
#zenum=string
#2:record[ts:time,proto:zenum]
2:[1425565514.419939;udp;]

has duplicate types 1 and 2 in the output, vs the same without -b 2 which does not:

$ echo "#0:record[ts:time]
0:[1425565512.963801;]
#zenum=string
#1:record[ts:time,proto:zenum]
1:[1425565514.419939;udp;]
1:[1425565514.419939;udp;]
" | zq  - | zq -t -
#0:record[ts:time]
0:[1425565512.963801;]
#zenum=string
#1:record[ts:time,proto:zenum]
1:[1425565514.419939;udp;]
1:[1425565514.419939;udp;]
philrz commented 4 years ago

Verified in zq commit 276030d.

I first took a step back and did something like the original repro after importing the Zeek format zq-sample-data into GA Brim tagged v0.12.0 pointing at zqd commit v0.16.0. I'm then doing the querying against the all.zng using zq commit 4ac12dd, which still had the issue.

$ zq -t '_path=conn | every 1min count() by proto | sort ts' all.zng
#zenum=string
#0:record[ts:time,proto:zenum,count:uint64]
0:[1521911700;icmp;2840;]
#zenum=string
#1:record[ts:time,proto:zenum,count:uint64]
1:[1521911700;icmp;45;]
#zenum=string
#2:record[ts:time,proto:zenum,count:uint64]
2:[1521911700;tcp;3051;]
1:[1521911700;udp;33;]
0:[1521911700;udp;441;]
2:[1521911700;icmp;110;]
...

Then querying the same all.zng with zq commit 276030d:

$ zq -t '_path=conn | every 1min count() by proto | sort ts' all.zng | head -20
#zenum=string
#0:record[ts:time,proto:zenum,count:uint64]
0:[1521911700;udp;533;]
0:[1521911700;tcp;43771;]
0:[1521911700;icmp;3324;]
0:[1521911760;tcp;79586;]
0:[1521911760;udp;856;]
0:[1521911760;icmp;504;]
0:[1521911820;icmp;58;]
0:[1521911820;tcp;46524;]
0:[1521911820;udp;1062;]
...

Thanks @henridf!