apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.65k stars 3.55k forks source link

[C++] Add optional transpose map parameter to arrow::compute::DictionaryEncode #44665

Open anjakefala opened 2 weeks ago

anjakefala commented 2 weeks ago

Describe the enhancement requested

dictionary_encode takes an Array as input and converts it into a DictArray. However, it does not let you set the encoding.

Users would then need to afterwards use the Transpose API to set the encoding they'd like - a step which does involve a copy.

Having both of these API is great, but is there anything blocking us from expanding dictionary_encode so that it accepts a transpose map?

Component(s)

C++

zeroshade commented 2 weeks ago

The transpose_map is typically created from using a DictionaryUnifier. What kind of "encoding" are you referring to here? In general, you shouldn't need a transpose_map when you are first encoding an array.

If you have multiple arrays that you want to dictionary encode with the same dictionary, then you can just create a ChunkedArray and call dictionary_encode on that, right? You only need to transpose indices when you already have dictionaries and want to unify multiple dictionary arrays to all use the same underlying dictionary.

What's the use case for how you'd already have a transpose map to pass to dictionary_encode?