Open tshauck opened 3 months ago
take
Note that I just saw that https://github.com/apache/arrow-rs/issues/6370 was merged into arrow-rs which may be relevant to this work
Filed upstream ticket https://github.com/apache/arrow-rs/issues/6717 to have regexp_match updated to support StringViewArray's. In the meantime we either wait for a release with that update or we implement it in DF and remove it once it's implemented upstream.
Part of https://github.com/apache/datafusion/issues/11752 and https://github.com/apache/datafusion/issues/11790
Currently, a call to
REGEXP_MATCH
with a Utf8View datatypes induces a cast. After the change that fixes this issue, it should not.REGEXP_MATCH is defined here: https://github.com/apache/datafusion/blob/main/datafusion/functions/src/regex/regexpmatch.rs
casting tests are in: https://github.com/apache/datafusion/blob/main/datafusion/sqllogictest/test_files/string_view.slt
Is your feature request related to a problem or challenge?
We are working to add complete StringView support in DataFusion, which permits potentially much faster processing of string data. See https://github.com/apache/datafusion/issues/10918 for more background.
Today, most DataFusion string functions support DataType::Utf8 and DataType::LargeUtf8 and when called with a StringView argument DataFusion will cast the argument back to DataType::Utf8 which is expensive.
To realize the full speed of StringView, we need to ensure that all string functions support the DataType::Utf8View directly.
Describe the solution you'd like
Update the function to support DataType::Utf8View directly
Describe alternatives you've considered
The typical steps are:
string_view.slt
to ensure the arguments are not being castSignature
of the function to acceptUtf8View
in addition toUtf8
/LargeUtf8
Utf8View
Example PRs
Additional context
The documentation of string functions can be found here: https://datafusion.apache.org/user-guide/sql/scalar_functions.html#string-functions
To test a function with StringView with
datafusion-cli
you can use an example like this (replacingstarts_with
with the relevant function)To see if it is using utf8 view, use
EXPLAIN
to see the plan and verify there is noCAST
. In this example theCAST(column1@0 AS Utf8)
indicates that the function is not usingUtf8View
nativelyIt is also often good to test with a constant as well (likewise there should be no cast):