apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.58k stars 3.54k forks source link

[Python][C++] pyarrow.repeat returns an invalid array when a chunked array is required. #36388

Open daniel-shields opened 1 year ago

daniel-shields commented 1 year ago

Describe the bug, including details regarding any error messages, version, and platform.

pyarrow.repeat silently fails and returns an invalid array when the result cannot fit in a single chunk.

import sys
import pyarrow as pa
print(f"{sys.version}")
print(f"{pa.__version__=}")

array = pa.repeat("?", 2**31 - 1)
print(array)
array = pa.repeat("?", 2**31)
print(array)

output:

3.9.12 (main, Jan  1 2020, 00:00:00) 
[GCC 8.3.0]
pa.__version__='12.0.0'
[
  "?",
  "?",
  ...
  "?",
  "?",
]
<Invalid array: Negative offsets in binary array>

Component(s)

Python

AlenkaF commented 1 year ago

I agree the code should give an error when the result doesn't fit into an array.

Arrow format represents arrays of length up to the maximum 32-bit signed integer: https://arrow.apache.org/docs/format/Columnar.html#array-lengths

and pyarrow.repeat or MakeArrayFromScalar function in C++ returns an Array, not a ChunkedArray: https://github.com/apache/arrow/blob/cd1ed18fd1e08912ea47b64edf55be9c046375c4/python/pyarrow/array.pxi#L426 https://github.com/apache/arrow/blob/cd1ed18fd1e08912ea47b64edf55be9c046375c4/r/src/scalar.cpp#L61

If the length is exceeding the limit, it should raise an informative error on the C++ side.

AlenkaF commented 1 year ago

Oh sorry, the last link to the C++ is wrong. I meant to add this:

https://github.com/apache/arrow/blob/fb8760e4d749c718d95fa784600f51e8b6fd2f43/cpp/src/arrow/array/util.cc#L851

After talking to @jorisvandenbossche about this issue I would like to add a proposed fix for it in the C++ here (contributions welcome).

The Negative offsets in binary array message is coming from CreateOffsetsBuffer:

https://github.com/apache/arrow/blob/fb8760e4d749c718d95fa784600f51e8b6fd2f43/cpp/src/arrow/array/util.cc#L808-L817

where the value_length * (the number of repetitions) exceeds the int64 limit.

We could add a check for the overflow in the RepeatedArrayFactory for binary type:

https://github.com/apache/arrow/blob/fb8760e4d749c718d95fa784600f51e8b6fd2f43/cpp/src/arrow/array/util.cc#L638-L649

using MultiplyWithOverflow, something similar to what we do here:

https://github.com/apache/arrow/blob/fb8760e4d749c718d95fa784600f51e8b6fd2f43/python/pyarrow/src/arrow/python/datetime.h#L162-L163

llama90 commented 1 year ago

@AlenkaF Hello.

I have taken an interest in this issue and upon review, I find some aspects to be ambiguous and would like to ask some questions.

It appears that in order to avoid overflow in this issue, the following two scenarios need to be considered:

  1. Initially, is the repetition count(e.g., 2**31 -1) greater than int32_t or less than 0?
  2. Does the product of the repeated value (e.g., ?) and repetition count trigger MultiplyWithOverflow?

Moreover, it seems possible to allocate the repetition count within the int64_t range, but when calling the c++ function from python code, is there a type conversion happening, leading the user to discover an error related to int32_t?

I wanted to test the cases exceeding int32_t, but I encountered an error like "mimalloc: warning: unable to allocate aligned OS memory directly, fall back to over-allocation (8598323200 bytes, address: 0x300800000, alignment: 67108864, commit: 1)" and was unable to proceed beyond this point.

Considering these points, should the range for repeat size be restricted to int32_t or should it be limited to int64_t?

AlenkaF commented 1 year ago

It appears that in order to avoid overflow in this issue, the following two scenarios need to be considered:

  1. Initially, is the repetition count(e.g., 2**31 -1) greater than int32_t or less than 0?
  2. Does the product of the repeated value (e.g., ?) and repetition count trigger MultiplyWithOverflow?

The main thing I would check is that value size multiplied with number of repetitions passes MultiplyWithOverflow when creating offsets buffer for the array with repeated values.

Moreover, it seems possible to allocate the repetition count within the int64_t range, but when calling the c++ function from python code, is there a type conversion happening, leading the user to discover an error related to int32_t?

I do not think there are any conversions happening. The int64_t range is the limit when creating the offset buffer and not a limit to the number of repetitions. Restricting the number of repetitions will not work, I think. For example if the string has a bigger value (length) then the limit for the number of repetitions will be even lower?