apache / arrow-rs

Official Rust implementation of Apache Arrow
https://arrow.apache.org/
Apache License 2.0
2.31k stars 678 forks source link

Consider implementing some sort of `deduplicate` / `intern` functionality for StringView #5910

Open alamb opened 1 week ago

alamb commented 1 week ago

Is your feature request related to a problem or challenge? Please describe what you are trying to do. Part of implementing StringView https://github.com/apache/arrow-rs/issues/5374

@XiangpengHao implemented gc which compacts all the strings in a StringView/BinaryView into contiguous storage in https://github.com/apache/arrow-rs/issues/5513

However, that functionality does not deduplicate/intern the strings -- it just copies them over

Describe the solution you'd like

We should make it easy to deduplicate the strings in a StringView.

I do think we should change gc to do deduplication without an explict as (as deduplication is expensive)

Describe alternatives you've considered

  1. Do nothing (users can implement their own version of this code without any addtional apis)
  2. Add a new function (e.g. GenericBinaryView::dedupe) that deduplicated such arrays (likely not moving any strings, but just updating views)
  3. Add an argument to GenericBinaryView::gc that controlled the behavior (as in could also specify doing gc)

Additional context @alexwilcoxson-rel asked in https://github.com/apache/arrow-rs/issues/5904#issuecomment-2174386654

Can/will this incorporate deduping/interning/implicitly using the gc function that landed recently?

alamb commented 1 week ago

BTW the https://docs.rs/arrow/latest/arrow/array/type.StringDictionaryBuilder.html structure has code to do the deduplication quickly

So one way to implement a combination of gc and deduplication would be to create a DictionaryArray with a GenericByteDictionaryBuilder and then cast back to StringViewArray

With the code for fast DictionaryArray --> StringViewArray added in https://github.com/apache/arrow-rs/issues/5861, this would only copy the strings once (though it would build up intermediate indexes that maybe could be avoided with a direct approach)