apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.91k stars 1.12k forks source link

[Epic] Native `StringView` support for string functions #11790

Open alamb opened 1 month ago

alamb commented 1 month ago

Is your feature request related to a problem or challenge?

We are working to add complete StringView support in DataFusion, which permits potentially much faster processing of string data. See https://github.com/apache/datafusion/issues/10918 for more background.

Today, most DataFusion string functions support DataType::Utf8 and DataType::LargeUtf8 and when called with a StringView argument DataFusion will cast the argument back to DataType::Utf8 which is expensive.

To realize the full speed of StringView, we need to ensure that all string functions support the DataType::Utf8View directly.

Describe the solution you'd like

Port all string functions

Describe alternatives you've considered

No response

Additional context

See coordination plan with @tshauck and myself here: https://github.com/apache/datafusion/pull/11787#discussion_r1702294173

alamb commented 1 month ago

One thing I have noticed during implementations is that some functions such as ltrim/rtrim/btrim could be more efficient if they produced Utf8View as output in addition to accepting them as input

For example, in https://github.com/apache/datafusion/pull/11920#discussion_r1713618503 from @Kev1n8 it is actually probably a good idea to always generate StringView as output (rather than StringArray) as it could avoid a copy.

I am thinking once we get the string functions so they can support StringView as input then we can do a second pass and optimize some functions so they produce StringView as output

2010YOUY01 commented 1 month ago

Inspired by @Omega359 's great PR https://github.com/apache/datafusion/pull/11941, I have some suggestion on testing Utf8View support for functions:

Although most implementation is adapted from existing implementation, but the execution takes another path, so I think comprehensive end-to-end tests are still needed. The good news is there already exists sqllogictests for original string functions, the only thing to do is to duplicate existing testings with Utf8View

Here are the examples on how to adapt existing test cases for Utf8View input

  1. For functions takes scalar value, use arrow_cast() like https://github.com/apache/datafusion/pull/11941/files#diff-51757b2b1d0a07b88551d88eabeba7f74e11b5217e44203ac7c6f613c0221196
  2. For functions read from a table, string column can be converted to Utf8View column like https://github.com/apache/datafusion/blob/2cf09566af7d7d5f83a8bdff5f0adda97d40deee/datafusion/sqllogictest/test_files/string_view.slt#L30-L42
alamb commented 3 weeks ago

We are making pretty good progress here -- just a few more functions left 🚀