brimdata / super

A novel data lake based on super-structured data
https://zed.brimdata.io/
BSD 3-Clause "New" or "Revised" License
1.39k stars 64 forks source link

zar demo regression at zq commit 617047f #946

Closed philrz closed 4 years ago

philrz commented 4 years ago

I was running a zar demo yesterday and it went off the rails when running through the simple graph queries step. I've done a binary search and determined that the regression is correlated with commit 617047f that came with @henridf's #914 ("Groupby: remove deterministic output order"). Was zar somehow dependent on the previous "deterministic but undefined" order?

Assuming the setup described in the zar README, the following repro.sh script has the commands from the demo that make it easy to repro:

#!/bin/sh
cd $HOME
rm -rf logs
mkdir logs
zq zng/*.gz | zar import -s 25MB -
zar zq -o forward.zng "id.orig_h != null | put from=id.orig_h,to=id.resp_h | count() by from,to" _
zar zq -o reverse.zng "id.orig_h != null | put from=id.resp_h,to=id.orig_h | count() by from,to" _
zar zq -o directed-pairs.zng "sum(count) as count by from,to" forward.zng reverse.zng
zar index -i directed-pairs.zng -o graph  -k from,to -z "*"
zar find -z -x graph 216.58.193.195 | zq -f table "sum(count) as count by from,to" -

As of commit 8dab96b that came right before #914 was merged, the output lines up with what's in the README:

$ ./repro.sh 
...
FROM           TO           COUNT
216.58.193.195 10.47.1.100  6
216.58.193.195 10.47.1.150  8
216.58.193.195 10.47.1.151  8
216.58.193.195 10.47.1.152  18
...

Then at commit 617047f, the output is short, different, and tellingly... not deterministic:

$ ./repro.sh 
...
FROM           TO          COUNT
216.58.193.195 10.47.5.152 11

$ ./repro.sh 
...
FROM           TO          COUNT
216.58.193.195 10.47.6.159 2
216.58.193.195 10.47.2.155 14
alfred-landrum commented 4 years ago

The failure occurs because directed-pairs.zng is no longer sorted before its used to create an index; that can be addressed immediately by altering how it is created: zar zq -o directed-pairs.zng "sum(count) as count by from,to | sort from,to" forward.zng reverse.zng

When we've given this demo, we've called out that we'd like to simplify these steps. One possibility is that we view the directed-pairs.zng as temporary data, that doesn't need to persist. So we'd modify custom index creation to take one or more files, operate on them, ensuring their desired key fields, etc.

henridf commented 4 years ago

Also, some context here is that pre-#914, groupby outputs were sorted in "undefined but deterministic" order, where "undefined" meant "compare the byte-level representations of the values". In the case of IP addrs, the byte-level comparisons actually result in the numeric order that one would expect for IPs (unlike for some other zng types). So long story short, the fact that this worked previously was a happy accident.

philrz commented 4 years ago

image

alfred-landrum commented 4 years ago

I put https://github.com/brimsec/zq/pull/954 so that anyone following the readme would get the correct behavior.

philrz commented 4 years ago

I've verified that as of zq commit 8cdcf4b, a run-through the entire zar README demo script is working again.

Thanks @alfred-landrum!