tustvold commented 1 year ago

Currently a number of operations are implemented directly on ScalarValue, including:

Not only does this result in a huge amount of code, but also these operations don't behave the same way as their array counterparts.

For example:

These kernels largely appear to exist for the purposes of aggregation, where the aggregated types are known statically. We should replace these uses with specialization, as done in https://github.com/apache/arrow-datafusion/pull/6800#discussion_r1248104156. The remaining uses should make use of the new Datum abstraction https://github.com/apache/arrow-rs/pull/4393 to use the same arrow-rs kernels https://github.com/apache/arrow-rs/pull/4465

No response

4973 tracks improving the aggregator performance

alamb commented 9 months ago

I think we have made substantial progress on this issue -- what is left to do?

tustvold commented 9 months ago

IIRC there are some aggregates, like first and last that are not yet specialized

apache / datafusion