Open alamb opened 2 months ago
I would propose that this change when made happens only after https://github.com/apache/datafusion/issues/12119 lands.
This is probably a good exercise now
Should this be the default for all functions that handle strings now? Require they can accept Utf8View as input for all fields and produce Utf8View output ?
I ask because that should solve issues with chained functions causing a series of casts such as
<function 1 = utf8 -> Utf8View), <function 2 = utf8 -> Utf8) ...
Should this be the default for all functions that handle strings now? Require they can accept Utf8View as input for all fields and produce Utf8View output ?
I am not sure
I think each function should produce Utf8 / Utf8View depending on what makes sense (Utf8 is more efficient if you have to rewrite all the strings anyways)
Once string functions accept either Utf8/Utf8View which I think would avoid the chained casting you are describing
Is your feature request related to a problem or challenge?
Part of https://github.com/apache/datafusion/issues/11752
StringView is a new arrow array type that allows for more efficient string processing -- specifically it allows string data to be adjusted without copying the underlying data
See this blog post for more details: https://www.influxdata.com/blog/faster-queries-with-stringview-part-one-influxdb/
@Kev1n8 added support for
StringView
to thesubstr
function in https://github.com/apache/datafusion/pull/12044At the moment
substr
produces aStringArray
output when the input isStringArray
, but we could actually generate aStringViewArray
as output which would be more efficient in most cases (avoids copying the string values)However, in order to avoid errors when
substr
is used in an expression, we need to make sure that all the rest of the String functions support StringView as input as well. Aka we should wait for the "Required for enabling StringView by default" list on https://github.com/apache/datafusion/issues/11752 to be completedDescribe the solution you'd like
substr
to beStringViewArray
when the input isStringArray
(note forLargeStringArray
we will still need to copy the data I think asStringView
is limited to 2^32 bytes)substr
to useStringView
internallyDescribe alternatives you've considered
No response
Additional context
Note that @Kevin8 has already added support for
StringView
to thesubstr
function in https://github.com/apache/datafusion/pull/12044They also suggested this same optimization could be applied https://github.com/apache/datafusion/pull/12044#issuecomment-2316111793