apache / arrow-rs

Official Rust implementation of Apache Arrow
https://arrow.apache.org/
Apache License 2.0
2.37k stars 696 forks source link

Document ChunkedArray Abstractions #5295

Open tustvold opened 6 months ago

tustvold commented 6 months ago

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

The question has come up a couple of times as to why we don't have a ChunkedArray abstraction, we should document why we don't and what the equivalent constructions are.

Describe the solution you'd like

A ChunkedArray is really just sugar over the top of Vec<ArrayRef>, and is used within arrow-cpp and pyarrow for representing large in-memory datasets.

Equivalent constructions in arrow-rs would be:

There are also equivalent constructions using RecordBatch instead of ArrayRef.

These abstractions are strictly more flexible the pyarrow ChunkedArray concept, integrate better with the Rust ecosystem, and encourage users towards lazy evaluation, which has much better memory usage characteristics.

Describe alternatives you've considered

Additional context

alamb commented 6 months ago

impl Stream a lazy async version of a ChunkedArray - this is what DataFusion uses extensively

In case anyone wants details, this is called RecordBatchStream:

https://docs.rs/datafusion/latest/datafusion/execution/trait.RecordBatchStream.html