apache / arrow-rs

Official Rust implementation of Apache Arrow
https://arrow.apache.org/
Apache License 2.0
2.31k stars 678 forks source link

Potential performance improvements for reading Parquet to StringViewArray/BinaryViewArray #5904

Open alamb opened 2 weeks ago

alamb commented 2 weeks ago

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

In https://github.com/apache/arrow-rs/issues/ @ariesdevil @XiangpengHao and I implemented pretty fast reading of data in Parquet to Arrow StringViewArray

The solution we have so far is https://github.com/apache/arrow-rs/pull/5877 which doesn't copy the string data šŸŽ‰ , but does track a set of offsets which are then converted into StringViewArray

@ariesdevil had a more comprehensive approach in https://github.com/apache/arrow-rs/pull/5557 that built the StringViews directly from the encoded data but hadn't yet removed the string copies

Describe the solution you'd like It may be worth looking at the StringViewDecoding to see if there is more performance to be had.

Specifically we can se the arrow_array_reader/StringViewArray and related benchmarks to profile and identify any additional potential improvements

Describe alternatives you've considered It may be good enough now

Additional context

alexwilcoxson-rel commented 1 week ago

Can/will this incorporate deduping/interning/implicitly using the gc function that landed recently?

XiangpengHao commented 1 week ago

Can/will this incorporate deduping/interning/implicitly using the gc function that landed recently?

The current gc function won't deduplicating strings, it only use GenericByteViewBuilder to create a new instance of the array. I think it would be a great addition to implement the deduplicating logic. A straightforward approach is to use a hash table to track the location of the strings while building the GenericByteView. It is not on my top priority list, but might give it a try when I have time.

alamb commented 1 week ago

Can/will this incorporate deduping/interning/implicitly using the gc function that landed recently?

I filed https://github.com/apache/arrow-rs/issues/5910 to track discussing this option

XiangpengHao commented 4 days ago

take