apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.4k stars 3.5k forks source link

[C++][Parquet] Parquet] Use arrow compute to determine min/max of dictionaries (possibly other arrays?) #42981

Open asfimport opened 3 years ago

asfimport commented 3 years ago

parquet::Comparator is currently used to calculate the min & max values of an array.  This should be benchmarked against arrow::compute's MinMax kernel (once it supports all necessary data types).  The latter should be more aggressive with SIMD resulting in better performance.

Even if there is no performance difference the MinMax kernel should be used when computing dictionary statistics as the current implementation requires making a copy of the dictionary values array (see ARROW-12513)

Reporter: Weston Pace / @westonpace

Note: This issue was originally created as PARQUET-2068. Please see the migration documentation for further details.

asfimport commented 3 years ago

Weston Pace / @westonpace: I'm not sure if this is a better fit for the Arrow project.  I chose this simply because the current implementation lives in parquet/... and not arrow/... or parquet/arrow/...