apache / arrow-julia

Official Julia implementation of Apache Arrow
https://arrow.apache.org/julia/
Other
284 stars 59 forks source link

explanation of Arrow.Stream vs. Arrow.Table seems ambiguous #472

Open bdklahn opened 1 year ago

bdklahn commented 1 year ago

https://github.com/apache/arrow-julia/blob/f8d2203b07380e1423723b5bfe32356aa1239284/docs/src/manual.md?plain=1#L189

Italicizing iterate for the second instance vs the first doesn't seem to illustrate in any unambiguous difference.

Does "Table" iterate batch units, while "Stream" iterates record units?

As I understand, a stream might be more like "on-demand" iteration. -like a generator, which only advances when necessary or able (space opens in a fixed buffer), while "Table" might "eagerly" load an entire table (batch), at a time.

It's not clear to me what are the units of each iteration, and the difference if both are iterating a table.

Maybe it would be simpler to see, if I just look at the code. :-)

Moelf commented 1 year ago

While Arrow.Table will iterate all record batches in an arrow file/stream, concatenating columns

this is saying it iterates AND concatenate, so holding the entire table in RAM (especially a problem if you're dealing with compressed file).

bdklahn commented 1 year ago

While Arrow.Table will iterate all record batches in an arrow file/stream, concatenating columns

this is saying it iterates AND concatenate, so holding the entire table in RAM (especially a problem if you're dealing with compressed file).

That's part of the problem: The documentation is NOT saying that. You are.

After re-reading a dozen times, I think I might understand what this means to say.

I think it is partly confusing to use the word "iterate". That is a verb, which indicates something is now happening (e.g when a Table or Stream object are instantiated). It might make more sense to say "iterator" (noun). That is an object like a generator or like a file handle which points to a position on disk (or memory), and has some state knowledge of how far to jump ahead, for each iteration.

I believe when a Table is instantiated, it presents a view where all the batches appear as if there is a single "batch" (one table). In this case, an iterator might be constructed to have a step size of only one record (e.g. row). When a Stream object is constructed, it looks like each step will produce all the records in a batch, and wrap them as a Table. Each iteration will produce a new table (as the docs indicate).

I understand (from deduction and experience) that having compressed arrow will require an additional "buffer" of the actual binary format, because compressed bits aren't a memmap-optimized data form. So, yes, if you use compressed arrow, loading in batch by batch can mitigate this. But if you are using Arrow, at all, it almost doesn't make any sense to compress (even, e.g. the lz4 compressed feather format). If you are doing any compression, maybe just use parquet. So, given normal (uncompressed) Arrow, an "entire table in RAM" should not be an issue. That's one of the main purposes of using Arrow, in the first place: "out of memory" processing. I would think the memory-loaded schema size for Table would not be significantly bigger than that of each tabular batch (if at all).

Anyway, in terms of the Table interface, it might even be bad practice to even mention iteration. Typically you want to encourage thinking about things in terms of vectorization. (e.g. broadcasting over as many rows at once) vs. any implication of processing anything row by row. -at least for DataFrame style interfacing. For Stream, it makes sense to mention iteration (for the comparison context), because the "processing" here is the actual process of of loading (and unloading) batches of records in and out of memory.

I'm not sure if something a little more like this would make sense (and is accurate):

". . . While Arrow.Table provides an interface where all record batches appear vertically concatenated into a single table, Arrow.Stream creates an iterator, where each iteration produces a separate table for from each batch. A Stream can be helpful for large compressed (e.g. lz4-compressed feather files)", where decompressed arrow data will need to be buffered in memory. The buffer would only need to accommodate one batch at a time, vs. all the batches at once as would be the case with Arrow.Table."

That could probably be simplified. I left out the "concatenating columns", because I'm not sure how relevant or distracting that might be, in this context. I mean . . . that's already implicit in the definition of an Arrow batch.

Moelf commented 1 year ago

That's part of the problem: The documentation is NOT saying that. You are.

beat me, ever since this project merged into Apache monorepo, it's impossible to get anything through in a responsive manner, matter of fact the "doc is not rendering up to date" took what I feels like almost a year to address. So sorry, I can only add information in github issues :)