Iterator based write API

elibenporat commented 4 years ago

Some thoughts on the Write (Encode?) API, specific to my use case; this is somewhat relevant to the Read (Decode) API as well. Now that I've built it into my codebase, I have a clearer understanding of it.

BOSS consumes data from the MLB API and then persists it to disk. To avoid duplicating work, it currently caches all the games it pulls into a separate file (just the GamePK is cached). I think my needs on the read/decode side are covered as all I need is the GamePK column, which based on your Readme, should be very efficient to query from the master data file. This will save me the worry that I cached a game as complete, but then had an issue when persisting the data to disk.

For writes, it currently does a simple append to a CSV file. Ideally, there would be some way for me to get the encoding format from the current file, tell tree-buf to use that encoding for the new data and give me bytes that I can just append to the current file. This would also have huge performance wins as it wouldn't need to analyze the source data, it would do all that work with the first file it creates.

I assume there would be some complexity in reading/updating the dictionary, so I don't know how feasible this is technically. This may require splitting up the "decode" bytes from "data" bytes such that the decode bytes would need to be overwritten, or appended to separately from the "data" bytes.

Building on that, it may be useful to have the ability to do a deeper sample on the initial file, for those that will want to append rather than re-build.

I still owe you the diagnostics on the main data set, will hopefully get to that next weekend.

That3Percent commented 4 years ago

I'm not entirely sure that I understand every aspect of your use-case, but having a first-class API for mutable/append data within a Tree-Buf file is somewhere between unlikely and far in the future. I'd like to give a basic idea of why the file format might make that difficult, but then get into how you can take advantage of what Tree-Buf has today to solve your problem.

Complications

Tree-Buf stores all data separated by it's "path" from the root to some data. For a CSV, this is relatively simple but this is not the only case. It may be useful to consider the CSV case for analysis though. Let's consider a 2 column CSV with headers "A" and "B".

A schema for such a CSV might look like

struct Row {
  a: String,
  b: String,
}
type Schema = Vec<Row>;

This is what this would look like on-disk in Tree-Buf assuming 10 rows:

[type: Vec len: 10, elements: [ type Object, field count: 2, field "A": [type: UTF-8, buffer: [byte_len: 30, data: "a1", "a2", "a3", "a4", "a5", "a6", "a7", "a8", "a9", "a10"]]], field "B": [type: UTF-8, buffer [byte_len: 30, data: ["b1", "b2", "b3", "b4", "b5", "b6", "b7", "b8", "b9", "b10"]]]]

Now, we want to add 10 more rows. I count 5 places in the file that need to be modified:

root.len
root.elements.field_a.buffer.byte_len
root.elements.field_a.buffer.data
root.elements.field_b.buffer.byte_len
root.elements.field_b.buffer.data

Editing any of these locations may cause all the other data in the file to need to shift down, at which point you are basically re-writing the file. This is even true for eg: the first root.len since it will be stored in PrefixVarInt which uses only 1 byte up to the value 127, but at least 2 bytes for values that are higher than that.

Consider also that the use-case of "appending rows" is specific to only a subset of the kind of data that Tree-Buf supports. For a lot of kinds of data (say, a GraphQL response which may have fields at the root that aren't described as rows) it's hard to describe how this API would interact with those other schemas.

This is a fundamental trade-off that Tree-Buf makes that enables the compression in the first place.

What you should do

If you are retrieving data in reasonably sized chunks that on their own compress well using Tree-Buf, then consider just appending independent Tree-Buf logical files to the physical file. I mean, have a basic framing protocol that serializes to tree_buf.

Eg (pseudocode): let chunk = encode(data);. Then append to your file like so: let len = encodePrefixVarint(chunk.len()); write_to_file(len); write_to_file(chunk);. On the read side you would just do let (len, offset) = decodePrefixVarint(stream); let tb_file = read_bytes(len, offset, stream); in a loop until the file is exhausted.

elibenporat commented 4 years ago

Makes a lot of sense. I certainly don't want you to go down a path of adding complexity to your code base for my specific use case (especially when it's a non-goal of the project).

Appending rows is a best case scenario for me, but not super-critical. I'll probably just put all new data into a temporary tree-buf file and then merge it into the main tree-buf file at the end. This should yield some nice performance diagnostics.

Thanks for the detailed explanation.

That3Percent / tree-buf

Iterator based write API #17

Complications

What you should do