brimdata / super

A novel data lake based on super-structured data
https://zed.brimdata.io/
BSD 3-Clause "New" or "Revised" License
1.39k stars 64 forks source link

Improve sorting used during ingest #525

Closed philrz closed 4 years ago

philrz commented 4 years ago

Brim v0 ensures records are sorted by timestamp, starting from the TSV logs generated by Zeek and then invoking the sort processor using a limit:

sort -r -limit 10000000 ts

We should use a different technique that's efficient & doesn't have a hardcoded maximum record cap.

philrz commented 4 years ago

Verified in Brim commit e9b0840 talking to zqd commit 48ed30a.

It took a bit of waiting, but I was able to import the wrccdc "year1" data set, which is 12 GB of uncompressed Zeek logs, which was turned into 8.8 GB of uncompressed all.zng. I was no longer blocked by the 10-million record sort limit that used to be in place. Once I was presented with the splash of newest events, I did a count() and confirmed that I saw the same count as when running zq to count the unsorted data set.

image

~/Downloads/sampledata/wrccdc-year1/zeek-logs$ zq -t "count()" *
#0:record[count:uint64]
0:[96217189;]

Thanks @nwt!