HaoYang670 commented 2 years ago

This is an epic issue to enhance the substring kernel from performance, safety and supported types. You can find more info from the discussion: https://github.com/apache/arrow-rs/pull/1529#discussion_r846762535

List of individual tasks is below:

Performance

[x] #1512
[x] #1800

Safety

[x] #1529
[x] #1577
[x] #1639

New Features

[x] #1608
[x] #1618
[x] #1656
[x] #1768
[ ] Add support for list array
[ ] #1887
[ ] Add support for map array

alamb commented 2 years ago

Thanks @HaoYang670

I suggest:

Mark the current substring function as unsafe AND RENAME TO substring_bytes_unchecked
Implement safe substring for string array AND NAME substring_bytes
Implement substring_by_char for string array AND NAME substring

HaoYang670 commented 2 years ago

Hi @alamb. After thinking twice about the renaming, I suggest we leave the substring by byte as the implicit version, and make substring by char as an explicit version, so that we can make minimize API change (because the current substring is by byte) and easily extend the function:

function name	Supported array
substring	StringArray(by byte)(slow), BinaryArray, FixedSizeBinary, FixedSizedListArray, ListArray
unsafe substring	StringArray(by byte)(fast), and more if we find invalid outputs in the future
substring_by_char	StringArray

In this way, the back compatibility will not be broken.

alamb commented 2 years ago

Thank you for your thoughts @HaoYang670

I still think switching substring to use chars is preferable because:

For ascii (single byte utf8) text, bytes and chars are equivalent
If someone has multi-byte utf8 string data the substring calculations are likely subtlety incorrect and they are in danger if creating invalid utf8 when using substring
If we make substring (by bytes) safe, the performance will regress which some people will regard as backwards incompatible as well

Perhaps other maintainers such as @nevi-me @viirya @sunchao, @tustvold and @jhorstmann have some thoughts about if/how we should change substring to handle utf8 ?

tustvold commented 2 years ago

Coming at the problem from a Rust perspective, as opposed to potentially a query engine perspective, I personally would expect it to behave similarly to the standard library. That is:

Bytes by default, with validation that it isn't splitting a codepoint
Additional Unsafe unchecked bytes APIs
Separate APIs for operating on chars.

That being said I don't really understand the use-cases of substring within a query engine context, i.e. if there is some SQL primitive it maps onto or something, and so if there is some standard practice here...

With regards to the checked bytes API, whatever it is called, there is a bit-twiddling trick you can do to make it relatively cheap - see here. Assuming the contents of the string before were valid UTF-8, it is sufficient to just verify you aren't splitting a codepoint - you do not need to revalidate the entire array.

alamb commented 2 years ago

I think @tustvold 's perspective matches @HaoYang670 's proposal https://github.com/apache/arrow-rs/issues/1531#issuecomment-1097613212 . I am convinced by @HaoYang670 's proposal

HaoYang670 commented 2 years ago

there is a bit-twiddling trick you can do to make it relatively cheap - see here.

This is a fantastic point! Maybe the cost of checking utf-8 validation can be so cheap that we don't need the unsafe substring at all!

HaoYang670 commented 2 years ago

So I think the first step is to add the utf-8 validation checking into the current substring function and test the performance.

viirya commented 2 years ago

Hmm, @tustvold and @HaoYang670 's proposal looks good. I think It is good idea to follow up the standard library.

Bytes by default, with validation that it isn't splitting a codepoint Additional Unsafe unchecked bytes APIs Separate APIs for operating on chars.

About the proposal, my question is, do we really need two APIs for substring on bytes? Is the validation is necessary? As we have separate API for operating on chars, users should use it for utf8 cases, I think. What use cases it could be that substring is called to operate on bytes, but it still needs to make sure not messing up utf8 content?

HaoYang670 commented 2 years ago

Hi @viirya. In general: Speed: substring by byte unchecked >= substring by byte checked >> substring by char Safety: substring by char > substring by byte checked > substring by byte unchecked. (If the input array has an invalid utf8 string, substring by byte checked might not find it, but substring by char can always find it) User-friendly: substring by char > substring by byte checked > substring by byte unchecked

That's why we need 3 versions of substring.

viirya commented 2 years ago

Thanks @HaoYang670 . It is clear from the above list, substring by byte checked is always in the middle in many aspects. So it makes me wondering if it is really needed? For safety case, we want to make sure valid utf8 result, we definitely choose substring by char. If we don't care about it, we use substring by byte unchecked. I'm not sure what substring by byte checked can provide us.

HaoYang670 commented 2 years ago

We can design a situation where substring by byte checked is the best choice:

Suppose we want to get the substrings with the byte length of the shortest string in the array. We could do in this way by using the length kernel and the aggregation kernel:

substring(string_array, start = 0, length = Some(min(length(string_array))))

Then, give an input string_array like this:

let input  = [
    "14 bytes  🗻",
    "This string has 24 bytes",
    "🗻 13 bytes"
];

We will get an invalid string array if using substring by byte unchecked, which is dangerous. By using substring by byte checked we can find the error as early as possible.

HaoYang670 commented 2 years ago

We don't need to implement the unsafe substring based on the benchmark result of #1577. The utf-8 validation checking only decreases the speed by about 10%.

apache / arrow-rs

Epic: enhance the `substring` kernel #1531

List of individual tasks is below:

Performance

Safety

New Features