brimdata / super

A novel data lake based on super-structured data
https://zed.brimdata.io/
BSD 3-Clause "New" or "Revised" License
1.39k stars 64 forks source link

groupby "every" assumes forward sorted data #847

Closed henridf closed 4 years ago

henridf commented 4 years ago

If records presented to a every X groupby are not in forward order, then that groupby will emit multiple groups for a given time bin. For example:

$ zq -t "every 1h count()" ~/Downloads/sampledata/wrccdc-year1/zng-logs/conn.zng | head -n 10
#0:record[ts:time,count:uint64]
0:[1490382000;88795;]
0:[1490382000;100;]
0:[1490382000;100;]
0:[1490382000;100;]
0:[1490382000;92;]
0:[1490382000;98;]
0:[1490382000;100;]
0:[1490382000;92;]
0:[1490382000;100;]
henridf commented 4 years ago

(For anyone trying to repro this: it requires a suitable large input. A small handcrafted input that fits into one zbuf.Batch will be processed correctly.).

henridf commented 4 years ago

I now notice that this is related to the broader issue filed at https://github.com/brimsec/zq/issues/501. In any case worth having as a standalone issue as it should be fixed by a forthcoming PR.

philrz commented 4 years ago

I've confirmed with @henridf that this issue is no longer necessary, so I'm closing it. What's happened is that the change in https://github.com/brimsec/zq/pull/893 makes sure this problem is avoided by not emitting any bins until the end.

Here's verification. The original issue still existed as of zq commit 905504e that came right before the https://github.com/brimsec/zq/pull/893 change:

$ zq -version
Version: v0.14.0-21-g905504e

$ zq -t "every 1h count()" ~/Downloads/sampledata/wrccdc-year1/zng-logs/conn.zng | head -n 10
#0:record[ts:time,count:uint64]
0:[1490382000;88795;]
0:[1490382000;100;]
0:[1490382000;100;]
0:[1490382000;100;]
0:[1490382000;92;]
0:[1490382000;98;]
0:[1490382000;100;]
0:[1490382000;92;]
0:[1490382000;100;]

Then with the commit 9af46a0 (that coincides with https://github.com/brimsec/zq/pull/893), now we have unique bins as desired:

$ zq -version
Version: v0.14.0-22-g9af46a0

$ zq -t "every 1h count()" ~/Downloads/sampledata/wrccdc-year1/zng-logs/conn.zng | head -n 10
#0:record[ts:time,count:uint64]
0:[1490428800;104544;]
0:[1490486400;10427;]
0:[1490414400;116826;]
0:[1490472000;13587879;]
0:[1490400000;3702075;]
0:[1490457600;892137;]
0:[1490385600;8155901;]
0:[1490443200;103451;]
0:[1490461200;1287525;]

While unique, note that they are not sorted by default, so a downstream sort may be justified depending on what the user intends to do with the data next.

henridf commented 4 years ago

What's happened is that the change in #893 makes sure this problem is avoided by not emitting any bins until the end.

This is all correct, but adding one additional piece of context just in case anyone came across this issue without other context on what we're doing with groupby: in general, #893 does indeed wait till EOF to emit all bins. But that PR also included a feature for early-emitting bins when the incoming stream is sorted in the primary grouping key. In the case of every, that primary grouping key is ts, and groupby will early-emit bins if it knows that the input is time-sorted (currently, this happens only in zqd).