brimdata / super

A novel data lake based on super-structured data
https://zed.brimdata.io/
BSD 3-Clause "New" or "Revised" License
1.39k stars 64 forks source link

collect() shows bogus IP values #1595

Closed philrz closed 3 years ago

philrz commented 3 years ago

Repro is with zq commit 01fe98b.

Test data is below:

Archive.zip

(Those are TSV logs generated by GA Zeek v3.2.2 based on this test pcap, ZIP password: infected.)

If I run zq directly on the TSV logs, I get what appears to be correct output:

$ zq -t '_path=conn id.resp_p=8080 | nodes=collect(id.resp_h) by id.orig_h, id.resp_p' *
#port=uint16
#0:record[id:record[orig_h:ip,resp_p:port],nodes:array[ip]]
0:[[10.9.1.101;8080;][118.110.236.121;118.110.236.121;118.110.236.121;]]

If I turn the data into ZNG first, the output changes:

$ zq * | zq -t '_path=conn id.resp_p=8080 | nodes=collect(id.resp_h) by id.orig_h, id.resp_p' -
#port=uint16
#0:record[id:record[orig_h:ip,resp_p:port],nodes:array[ip]]
0:[[10.9.1.101;8080;][84.4.32.6;118.110.236.121;118.110.236.121;]]

We know the latter output is incorrect because the IP address 84.4.32.6 is not returned in a search of either format.

$ zq -t '84.4.32.6' *
[no output]

$ zq * | zq -t '84.4.32.6' -
[no output]
philrz commented 3 years ago

Verified in zq commit f1bf7cc7.

Now the output is the same even after the data has been turned into ZNG.

$ zq -version
Version: v0.23.0-12-gf1bf7cc7

$ zq -t '_path=conn id.resp_p=8080 | nodes=collect(id.resp_h) by id.orig_h, id.resp_p' *
#port=uint16
#0:record[id:record[orig_h:ip,resp_p:port],nodes:array[ip]]
0:[[10.9.1.101;8080;][118.110.236.121;118.110.236.121;118.110.236.121;]]

$ zq * | zq -t '_path=conn id.resp_p=8080 | nodes=collect(id.resp_h) by id.orig_h, id.resp_p' -
#port=uint16
#0:record[id:record[orig_h:ip,resp_p:port],nodes:array[ip]]
0:[[10.9.1.101;8080;][118.110.236.121;118.110.236.121;118.110.236.121;]]

Thanks @mccanne!