Open alamb opened 1 month ago
One thing I have noticed during implementations is that some functions such as ltrim
/rtrim
/btrim
could be more efficient if they produced Utf8View as output in addition to accepting them as input
For example, in https://github.com/apache/datafusion/pull/11920#discussion_r1713618503 from @Kev1n8 it is actually probably a good idea to always generate StringView as output (rather than StringArray) as it could avoid a copy.
I am thinking once we get the string functions so they can support StringView as input then we can do a second pass and optimize some functions so they produce StringView as output
Inspired by @Omega359 's great PR https://github.com/apache/datafusion/pull/11941, I have some suggestion on testing Utf8View
support for functions:
Although most implementation is adapted from existing implementation, but the execution takes another path, so I think comprehensive end-to-end tests are still needed.
The good news is there already exists sqllogictest
s for original string functions, the only thing to do is to duplicate existing testings with Utf8View
Here are the examples on how to adapt existing test cases for Utf8View
input
arrow_cast()
like https://github.com/apache/datafusion/pull/11941/files#diff-51757b2b1d0a07b88551d88eabeba7f74e11b5217e44203ac7c6f613c0221196Utf8View
column like https://github.com/apache/datafusion/blob/2cf09566af7d7d5f83a8bdff5f0adda97d40deee/datafusion/sqllogictest/test_files/string_view.slt#L30-L42We are making pretty good progress here -- just a few more functions left 🚀
Is your feature request related to a problem or challenge?
We are working to add complete
StringView
support in DataFusion, which permits potentially much faster processing of string data. See https://github.com/apache/datafusion/issues/10918 for more background.Today, most DataFusion string functions support
DataType::Utf8
andDataType::LargeUtf8
and when called with aStringView
argument DataFusion will cast the argument back toDataType::Utf8
which is expensive.To realize the full speed of
StringView
, we need to ensure that all string functions support theDataType::Utf8View
directly.Describe the solution you'd like
Port all string functions
Describe alternatives you've considered
No response
Additional context
See coordination plan with @tshauck and myself here: https://github.com/apache/datafusion/pull/11787#discussion_r1702294173