apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.52k stars 3.53k forks source link

[C++] Better support optional start/stop in "utf8_slice_codeunits" kernel #34929

Open jorisvandenbossche opened 1 year ago

jorisvandenbossche commented 1 year ago

Describe the bug, including details regarding any error messages, version, and platform.

There have been various (slightly different) bugs reported about using "utf8_slice_codeunits" with optional start or stop. The stop argument is already optional and translated into the largest int to indicate to always slice until the end, but that internal "workaround" also produces some bugs in the current implementation due to integer overflows.

Potentially, we could use a different mechanism to signal a default start/stop, such as using std::optional<int64_t> instead of std::numeric_limits<int64_t>::max()

Listing the related issues:

The option class is also used for "binary_slice" kernel.

Component(s)

C++

pitrou commented 1 year ago

cc @benibus . This would slightly break the C++ API so we have to make sure this would make things better.

benibus commented 1 year ago

If we were to modify SliceOptions directly then that would also affect the ascii kernel as well. Not opposed to doing that, just wanted to give a heads-up.

In any case, I'll probably try this with a distinct options class first to see how it goes with one of the kernels. Then we can determine if the breaking change is justified.

pitrou commented 1 year ago

If we were to modify SliceOptions directly then that would also affect the ascii kernel as well. Not opposed to doing that, just wanted to give a heads-up.

I think it's fine to affect both kernels, and actually it's quite logical as well.

If std::optional had been available to us before, we would probably have used it from the start here.