brimdata / super

A novel data lake based on super-structured data
https://zed.brimdata.io/
BSD 3-Clause "New" or "Revised" License
1.39k stars 64 forks source link

implement zdx bundle as a single file #612

Closed alfred-landrum closed 4 years ago

alfred-landrum commented 4 years ago

Filing from a live discussion from https://github.com/brimsec/zq/pull/600 : We could implement the zdx bundle as a single file, with potentially very little work, by concatenating the btree files with base file. In this case, the offsets stored would be interpreted as offsets in the next stream of the overall zdx file.

alfred-landrum commented 4 years ago

The zdx file would become a concatenation of streams:

Creating the zdx would work like this:

To use this to find keys, a zdx reader would read the zdx file, read the super-block to follow the offsets to the btree toc, then execute a btree search.

philrz commented 4 years ago

Verified in zq commit 04f2c11.

Revisiting how things looked at zq commit 90dddfc right before this change, after creating the micro-indexes as shown in the zar README, note the presence of the .1 files.

$ zq zng/*.gz | zar import -s 25MB -
$ zar index :ip
$ zar index uri
$ zar index -q -o custom -k id.orig_h -z "count() by _path, id.orig_h | sort id.orig_h"
$ tree -s logs/
logs/
├── [        320]  20180324
│   ├── [    4425007]  1521911772.980384.zng
│   ├── [        192]  1521911772.980384.zng.zar
│   │   ├── [       6303]  custom.zng
│   │   ├── [        119]  zdx-field-uri.1.zng
│   │   ├── [      96953]  zdx-field-uri.zng
│   │   └── [      29732]  zdx-type-ip.zng
│   ├── [   25001925]  1521912075.114273.zng
│   ├── [        192]  1521912075.114273.zng.zar
│   │   ├── [       8424]  custom.zng
│   │   ├── [         59]  zdx-field-uri.1.zng
│   │   ├── [     124555]  zdx-field-uri.zng
│   │   └── [      15699]  zdx-type-ip.zng
│   ├── [   25007413]  1521912507.399929.zng
│   ├── [        224]  1521912507.399929.zng.zar
│   │   ├── [      12323]  custom.zng
│   │   ├── [        143]  zdx-field-uri.1.zng
│   │   ├── [     172595]  zdx-field-uri.zng
│   │   ├── [         41]  zdx-type-ip.1.zng
│   │   └── [      80764]  zdx-type-ip.zng
│   ├── [   25005195]  1521912990.158766.zng
│   └── [        160]  1521912990.158766.zng.zar
│       ├── [       7538]  custom.zng
│       ├── [      62757]  zdx-field-uri.zng
│       └── [      28392]  zdx-type-ip.zng
└── [        778]  zar.json

5 directories, 21 files

Repeating the same steps at zq commit 04f2c11 that has this enhancement, the .1 files are now gone.

$ tree -s logs/
logs/
├── [        320]  20180324
│   ├── [    4425007]  1521911772.980384.zng
│   ├── [        160]  1521911772.980384.zng.zar
│   │   ├── [       6347]  custom
│   │   ├── [      97012]  zdx-field-uri.zng
│   │   └── [      29766]  zdx-type-ip.zng
│   ├── [   25001925]  1521912075.114273.zng
│   ├── [        160]  1521912075.114273.zng.zar
│   │   ├── [       8468]  custom
│   │   ├── [     124614]  zdx-field-uri.zng
│   │   └── [      15733]  zdx-type-ip.zng
│   ├── [   25007413]  1521912507.399929.zng
│   ├── [        160]  1521912507.399929.zng.zar
│   │   ├── [      12367]  custom
│   │   ├── [     172703]  zdx-field-uri.zng
│   │   └── [      80825]  zdx-type-ip.zng
│   ├── [   25005195]  1521912990.158766.zng
│   └── [        160]  1521912990.158766.zng.zar
│       ├── [       7582]  custom
│       ├── [      62792]  zdx-field-uri.zng
│       └── [      28426]  zdx-type-ip.zng
└── [        794]  zar.json

5 directories, 17 files

Thanks @mccanne!