apache / arrow-julia

Official Julia implementation of Apache Arrow
https://arrow.apache.org/julia/
Other
284 stars 59 forks source link

Writing and Reading Random Access Files #434

Open okartal opened 1 year ago

okartal commented 1 year ago

Maybe related to #353

It is already possible to use Tables.partitioner to write record batches to a single Arrow file. However, when I read that file with Arrow.Table I do not know how to access a specific record batch like here: https://arrow.apache.org/docs/java/ipc.html#writing-and-reading-random-access-files

According to the docs, this should be possible but I am not sure if that is not implemented yet or simply not documented.

quinnj commented 1 year ago

You're right that we don't expose this very well (i.e at all) via Arrow.Table right now; but using Arrow.Stream gives you back an iterator of Arrow.Table for each record batch. But we could probably also expose a way via Arrow.Table to let you get the individual tables. Something to think about, or at least improve in the docs mentioning Arrow.Stream.

okartal commented 1 year ago

According to https://arrow.apache.org/docs/python/ipc.html#writing-and-reading-random-access-files we need to use a seek method to implement random access to a batch

Moelf commented 1 year ago

we don't have to do any Python implementation says, that's specifically for Python. A batch is a well defined thing in file format, independent of which implementation we're talking about, it's purely a logical question of how do we get there given the schema / metadata and what's the interface for user