Open HaoYang670 opened 2 years ago
Thanks @HaoYang670
I suggest:
substring_bytes_unchecked
substring_bytes
substring
Hi @alamb. After thinking twice about the renaming, I suggest we leave the substring by byte
as the implicit version, and make substring by char
as an explicit version, so that we can make minimize API change (because the current substring
is by byte
) and easily extend the function:
function name | Supported array |
---|---|
substring | StringArray(by byte)(slow), BinaryArray, FixedSizeBinary, FixedSizedListArray, ListArray |
unsafe substring | StringArray(by byte)(fast), and more if we find invalid outputs in the future |
substring_by_char | StringArray |
In this way, the back compatibility will not be broken.
Thank you for your thoughts @HaoYang670
I still think switching substring
to use chars is preferable because:
substring
substring
(by bytes) safe, the performance will regress which some people will regard as backwards incompatible as wellPerhaps other maintainers such as @nevi-me @viirya @sunchao, @tustvold and @jhorstmann have some thoughts about if/how we should change substring
to handle utf8 ?
Coming at the problem from a Rust perspective, as opposed to potentially a query engine perspective, I personally would expect it to behave similarly to the standard library. That is:
That being said I don't really understand the use-cases of substring
within a query engine context, i.e. if there is some SQL primitive it maps onto or something, and so if there is some standard practice here...
With regards to the checked bytes API, whatever it is called, there is a bit-twiddling trick you can do to make it relatively cheap - see here. Assuming the contents of the string before were valid UTF-8, it is sufficient to just verify you aren't splitting a codepoint - you do not need to revalidate the entire array.
I think @tustvold 's perspective matches @HaoYang670 's proposal https://github.com/apache/arrow-rs/issues/1531#issuecomment-1097613212 . I am convinced by @HaoYang670 's proposal
there is a bit-twiddling trick you can do to make it relatively cheap - see here.
This is a fantastic point! Maybe the cost of checking utf-8 validation can be so cheap that we don't need the unsafe substring
at all!
So I think the first step is to add the utf-8 validation checking into the current substring
function and test the performance.
Hmm, @tustvold and @HaoYang670 's proposal looks good. I think It is good idea to follow up the standard library.
Bytes by default, with validation that it isn't splitting a codepoint Additional Unsafe unchecked bytes APIs Separate APIs for operating on chars.
About the proposal, my question is, do we really need two APIs for substring
on bytes? Is the validation is necessary? As we have separate API for operating on chars, users should use it for utf8 cases, I think. What use cases it could be that substring
is called to operate on bytes, but it still needs to make sure not messing up utf8 content?
Hi @viirya. In general:
Speed: substring by byte unchecked
>= substring by byte checked
>> substring by char
Safety: substring by char
> substring by byte checked
> substring by byte unchecked
. (If the input array has an invalid utf8 string, substring by byte checked
might not find it, but substring by char
can always find it)
User-friendly: substring by char
> substring by byte checked
> substring by byte unchecked
That's why we need 3 versions of substring
.
Thanks @HaoYang670 . It is clear from the above list, substring by byte checked
is always in the middle in many aspects. So it makes me wondering if it is really needed? For safety case, we want to make sure valid utf8 result, we definitely choose substring by char
. If we don't care about it, we use substring by byte unchecked
. I'm not sure what substring by byte checked
can provide us.
We can design a situation where substring by byte checked
is the best choice:
Suppose we want to get the substrings with the byte length of the shortest string in the array. We could do in this way by using the length
kernel and the aggregation
kernel:
substring(string_array, start = 0, length = Some(min(length(string_array))))
Then, give an input string_array
like this:
let input = [
"14 bytes 🗻",
"This string has 24 bytes",
"🗻 13 bytes"
];
We will get an invalid string array if using substring by byte unchecked
, which is dangerous.
By using substring by byte checked
we can find the error as early as possible.
We don't need to implement the unsafe substring
based on the benchmark result of #1577. The utf-8 validation checking only decreases the speed by about 10%.
This is an epic issue to enhance the
substring
kernel from performance, safety and supported types. You can find more info from the discussion: https://github.com/apache/arrow-rs/pull/1529#discussion_r846762535List of individual tasks is below:
Performance
Safety
New Features