Open alamb opened 1 week ago
BTW the https://docs.rs/arrow/latest/arrow/array/type.StringDictionaryBuilder.html structure has code to do the deduplication quickly
So one way to implement a combination of gc and deduplication would be to create a DictionaryArray with a GenericByteDictionaryBuilder
and then cast back to StringViewArray
With the code for fast DictionaryArray --> StringViewArray added in https://github.com/apache/arrow-rs/issues/5861, this would only copy the strings once (though it would build up intermediate indexes that maybe could be avoided with a direct approach)
Is your feature request related to a problem or challenge? Please describe what you are trying to do. Part of implementing
StringView
https://github.com/apache/arrow-rs/issues/5374@XiangpengHao implemented
gc
which compacts all the strings in a StringView/BinaryView into contiguous storage in https://github.com/apache/arrow-rs/issues/5513However, that functionality does not deduplicate/intern the strings -- it just copies them over
Describe the solution you'd like
We should make it easy to deduplicate the strings in a StringView.
I do think we should change
gc
to do deduplication without an explict as (as deduplication is expensive)Describe alternatives you've considered
GenericBinaryView::dedupe
) that deduplicated such arrays (likely not moving any strings, but just updating views)GenericBinaryView::gc
that controlled the behavior (as in could also specify doing gc)Additional context @alexwilcoxson-rel asked in https://github.com/apache/arrow-rs/issues/5904#issuecomment-2174386654