Open Dandandan opened 11 months ago
Was that because this counting operation is possible to be done during scanning?
Looks like it's a case of aggregate pushdown. For min()/max()/count()
aggregate functions on Parquet, it's possible to get the result on whole column only use metadata, without full scan.
To do that i think update_record_batch()
is needed, possibly also allow RecordBatch
to carry more flexible payloads
I think using a RecordBatch rather than &[ArrayRef] makes sense to me
If we are going to change the API anyways, I recommend considering changing the signature to ColumnarValue
so it can handle either a RecordBatch or a ScalarValue
The other thing maybe we can think about while messing with the Accumulator
trait is how we might expose GroupsAccumulator
as well 🤔
I looked a bit more into this, it looks currently we're getting away mostly by converting 1
scalars as "count expression" (count(Int64(1))
to an array with to_array_of_size
.
This is a bit wasteful, but also not extremely bad (as long as the size is not enormous).
Is your feature request related to a problem or challenge?
Currently the
CountAccumulator
implementation requiresvalues: &[ArrayRef]
to be passed.In order to eliminate scanning a (first) column, we need to be able to accept a
RecordBatch
ornum_rows
instead ofvalues: &[ArrayRef]
.Describe the solution you'd like
Rather than changing every method to accept a
RecordBatch
(and needing to update the code), I propose adding two new methods:update_record_batch(&mut self, recordbatch: &RecordBatch)
retract_record_batch(&mut self, recordbatch: &RecordBatch)
The default implementation of the methods can use
update_batch
andupdate_record_batch
(i.e. assume having at least one column).In the aggregation code, we call
update_record_batch
/retract_record_batch
instead.Describe alternatives you've considered
No response
Additional context
No response