brimdata / super

An analytics database that puts JSON and relational tables on equal footing
https://zed.brimdata.io/
BSD 3-Clause "New" or "Revised" License
1.39k stars 67 forks source link

Vector query with "where" clause returns incorrect count #5468

Open philrz opened 1 week ago

philrz commented 1 week ago

Repro is with super commit 411514f.

This is a simplification of the bench2/q4 query.

Test data is the contents of repro.jsup.gz:

{log_time:2012-01-01T00:00:44Z,client_ip:249.92.17.134,request:"/courses/cs106/2004/Assignments/rudimentary-interp.html",status_code:304(uint16),object_size:0(uint64)}(=bench2)
{log_time:2012-10-01T00:24:30Z,client_ip:249.92.17.134,request:"/people/sr/",status_code:200(uint16),object_size:2242(uint64)}(=bench2)
{log_time:2012-05-12T10:23:22Z,client_ip:251.58.48.137,request:"/robots.txt",status_code:404(uint16),object_size:506(uint64)}(=bench2)

This query against the original Super JSON returns the expected result.

$ super -version
Version: v1.18.0-142-g411514fd

$ super -c '
summarize
    num_requests := count()
    where log_time >= 2012-10-01T00:00:00Z
    by client_ip
' repro.jsup.gz 

{client_ip:249.92.17.134,num_requests:1(uint64)}
{client_ip:251.58.48.137,num_requests:0(uint64)}

However if I turn the Super JSON into Super Columnar and repeat the same query, now the counts are too high.

$ super -f csup -o repro.csup repro.jsup.gz 

$ super dev vector query '
summarize
    num_requests := count()
    where log_time >= 2012-10-01T00:00:00Z
    by client_ip
' repro.csup

{client_ip:251.58.48.137,num_requests:1(uint64)}
{client_ip:249.92.17.134,num_requests:2(uint64)}

If I drop the where clause, the results match.