apache / arrow-rs

Official Rust implementation of Apache Arrow
https://arrow.apache.org/
Apache License 2.0
2.54k stars 763 forks source link

`cast` kernel support for `StringViewArray` and `BinaryViewArray` `<--> `DictionaryArray` #5861

Closed alamb closed 4 months ago

alamb commented 4 months ago

Is your feature request related to a problem or challenge? Please describe what you are trying to do. This is part of the larger project to implement StringViewArray -- see https://github.com/apache/arrow-rs/issues/5374

In https://github.com/apache/arrow-rs/issues/5508, @RinChanNOWWW tracked adding casting to/from StringArray 🙏 ❤️

This ticket tracks adding additional data type support for StringViewArray and ByteViewArray in the cast kernel: https://docs.rs/arrow/latest/arrow/compute/kernels/cast/index.html

Many systems (e.g InfluxDB 3.0, Apache DataFusion Comet, and I think Coralogix) use DictionaryArrays. Thus supporting casting to/from DictionaryArray will be important to permit easy integration into downstream consumers

Describe the solution you'd like

Specifically the following conversions should be supported in the cast kernels:

And similarly for Binary:

Notes:

  1. Good test coverage is the most important part of this ticket
  2. I recommend smaller PRs if possible
  3. I think DictionaryArray<IndexType, LargeUtf8> --> StringViewArray can be implemented without copying strings
  4. I think StringViewArray --> DictionaryArray<IndexType, LargeUtf8> will likely require copying the strings

Describe alternatives you've considered I think casting from Dictionary

Additional context

XiangpengHao commented 4 months ago

I'm working on this, can you assign me @alamb ?

alamb commented 4 months ago

I'm working on this, can you assign me @alamb ?

Done!

XiangpengHao commented 4 months ago

StringViewArray --> DictionaryArray<IndexType, LargeUtf8> will copy strings twice. The current implementation is handled in https://github.com/XiangpengHao/arrow-rs/blob/view-to-dict/arrow-cast/src/cast/dictionary.rs#L259-L283 Which first cast the StringViewArray to StringArray then insert to dictionary one by one. A better approach is probably to directly insert items in StringViewArray to dictionary array.

alamb commented 4 months ago

A better approach is probably to directly insert items in StringViewArray to dictionary array.

Yes I agree this is likely -- I suggest using https://docs.rs/arrow/latest/arrow/array/builder/struct.GenericByteDictionaryBuilder.html directly

That will also deduplicate the values and result in the smallest dictionary possible. The downside is that inserting each value will require hashing on a string

I could imagine a faster implementation that keeps a map of u128 (aka the view) to the in-process dictionary values that would be faster (hashing u128 rather than &[u8]). The downside is that this wouldn't catch values where the views pointed to the different bytes that happend to have the same value (e.g. "foofoo" if there were two views that pointed to "foo" and the second "foo")

alamb commented 3 months ago

label_issue.py automatically added labels {'arrow'} from #5872