brimdata / super

A novel data lake based on super-structured data
https://zed.brimdata.io/
BSD 3-Clause "New" or "Revised" License
1.39k stars 64 forks source link

zar import missing records #1579

Closed alfred-landrum closed 3 years ago

alfred-landrum commented 3 years ago

Using the ~4GB wrccdc-year1 logs, I see a large discrepancy in records when I import it into zar:

$ zq -t "count()" ~/work/data/wrccdc-year1/zeek-logs-sort-r-ts.zng
#0:record[count:uint64]
0:[96217189;]
...
$ zar import -s 128MiB ~/work/data/wrccdc-year1/zeek-logs-sort-r-ts.zng
...
$ zar zq -t "count()"
#0:record[count:uint64]
0:[94839372;]

Additionally, if I try to manually verify against the data files in the resulting archive, I see this zq error:

$ find zarroot -name "d*.zng" | xargs zq -t "count()"
zarroot/zd/20170324/d-1k4BMtV8RQt4HzJxdVMFhkJv3DA.zng: _path (string): expected primitive type, got container

I wonder if this is related to the fact that some records have the same timestamp, as I saw a similar problem when I was working on the overlap support. (Though note the above runs don't perform any compaction).

philrz commented 3 years ago

Verified in zq commit 21b50d02.

The symptom was present as of commit 9ab4d535 that came before the change in the first of two linked PRs. I happened to have my own ZNG version of the wrccdc-year1 that yielded a different erroneous event count, but still showed the same symptoms of the differing count and error message.

$ zq -version
Version: v0.23.0-9-g9ab4d535

$ zq -t "count()" *
#0:record[count:uint64]
0:[96217189;]

$ zar import -s 128MiB *
$ zar zq -t "count()"
#0:record[count:uint64]
0:[96141291;]

$ find $ZAR_ROOT -name "d*.zng" | xargs zq -t "count()"
/Users/phil/logs/zd/20170324/d-1kNcQnzxTQLZvoN3owklLVK10qr.zng: _path (string): expected primitive type, got container

Now at zq commit 21b50d02, the counts are the same and we no longer see the final error message.

$ zq -version
Version: v0.23.0-23-g21b50d02

$ zq -t "count()" *
#0:record[count:uint64]
0:[96217189;]

$ zar import -s 128MiB *
$ zar zq -t "count()"
#0:record[count:uint64]
0:[96217189;]

$ find $ZAR_ROOT -name "d*.zng" | xargs zq -t "count()"
#0:record[count:uint64]
0:[96217189;]

Thanks @mattnibs!