Closed alamb closed 4 months ago
I'm working on this, can you assign me @alamb ?
I'm working on this, can you assign me @alamb ?
Done!
StringViewArray --> DictionaryArray<IndexType, LargeUtf8> will copy strings twice.
The current implementation is handled in https://github.com/XiangpengHao/arrow-rs/blob/view-to-dict/arrow-cast/src/cast/dictionary.rs#L259-L283
Which first cast the StringViewArray
to StringArray
then insert to dictionary one by one. A better approach is probably to directly insert items in StringViewArray to dictionary array.
A better approach is probably to directly insert items in StringViewArray to dictionary array.
Yes I agree this is likely -- I suggest using https://docs.rs/arrow/latest/arrow/array/builder/struct.GenericByteDictionaryBuilder.html directly
That will also deduplicate the values and result in the smallest dictionary possible. The downside is that inserting each value will require hashing on a string
I could imagine a faster implementation that keeps a map of u128
(aka the view) to the in-process dictionary values that would be faster (hashing u128 rather than &[u8]
). The downside is that this wouldn't catch values where the views pointed to the different bytes that happend to have the same value (e.g. "foofoo"
if there were two views that pointed to "foo"
and the second "foo"
)
label_issue.py
automatically added labels {'arrow'} from #5872
Is your feature request related to a problem or challenge? Please describe what you are trying to do. This is part of the larger project to implement
StringViewArray
-- see https://github.com/apache/arrow-rs/issues/5374In https://github.com/apache/arrow-rs/issues/5508, @RinChanNOWWW tracked adding casting to/from StringArray 🙏 ❤️
This ticket tracks adding additional data type support for
StringViewArray
andByteViewArray
in thecast
kernel: https://docs.rs/arrow/latest/arrow/compute/kernels/cast/index.htmlMany systems (e.g InfluxDB 3.0, Apache DataFusion Comet, and I think Coralogix) use DictionaryArrays. Thus supporting casting to/from
DictionaryArray
will be important to permit easy integration into downstream consumersDescribe the solution you'd like
Specifically the following conversions should be supported in the cast kernels:
StringViewArray
<-->DictionaryArray<IndexType, Utf8>
StringViewArray
<-->DictionaryArray<IndexType, LargeUtf8>
And similarly for
Binary
:BinaryViewArray
<-->DictionaryArray<IndexType, Binary>
BinaryViewArray
<-->DictionaryArray<IndexType, LargeBinary>
Notes:
DictionaryArray<IndexType, LargeUtf8>
-->StringViewArray
can be implemented without copying stringsStringViewArray
-->DictionaryArray<IndexType, LargeUtf8>
will likely require copying the stringsDescribe alternatives you've considered I think casting from Dictionary
Additional context