Open asmirnov82 opened 11 months ago
This is great. Could the DataFrame also support streaming operations (similar to how Apache Arrow Acero is architected?). I've recently been looking to implement support for Acero in the C# Arrow client: https://github.com/apache/arrow/pull/37544
@davesearle DataFrame was designed to handle situations where all the data is in memory, so streaming was not the primary goal. However DataFrame allows convertion to a collection of Arrow RecordBatches (for most of the cases without memory copy) and that's why potentionsly can be used as an input for Acero record_batch_source.
Background and motivation
Current arithmetic and computation API of the DataFrame is inconsistent and quite slow in scenarios where columns of different types are involved as each column casting to different type requires coping the entire column data.
Moreover some of the methods of computation API may produce inaccurate results (due to overflow exception).
Motivation of this change includes
Details of current implementation limitations
PrimitiveColumn<T>
instances and their concrete aliases (likeDoubleDataFrame
orInt32DataFrameColumn
)For example
results in sum column containing int values.
The same code, but referring columns using parent type
results in sum column containing double values.
var sum = column.Sum(); var mean = column.Mean();