Open tustvold opened 6 months ago
impl Stream
- a lazy async version of a ChunkedArray - this is what DataFusion uses extensively
In case anyone wants details, this is called RecordBatchStream
:
https://docs.rs/datafusion/latest/datafusion/execution/trait.RecordBatchStream.html
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
The question has come up a couple of times as to why we don't have a
ChunkedArray
abstraction, we should document why we don't and what the equivalent constructions are.Describe the solution you'd like
A
ChunkedArray
is really just sugar over the top ofVec<ArrayRef>
, and is used within arrow-cpp and pyarrow for representing large in-memory datasets.Equivalent constructions in arrow-rs would be:
Vec<ArrayRef>
: a fairly exact mirror toChunkedArray
sans some ergonomic niceties like working natively in kernelsimpl Iterator<Item=ArrayRef>
a lazy version of aChunkedArray
impl Stream<Item=ArrayRef>
a lazy async version of aChunkedArray
- this is what DataFusion uses extensivelyThere are also equivalent constructions using
RecordBatch
instead ofArrayRef
.These abstractions are strictly more flexible the pyarrow
ChunkedArray
concept, integrate better with the Rust ecosystem, and encourage users towards lazy evaluation, which has much better memory usage characteristics.Describe alternatives you've considered
Additional context