Open daniel-shields opened 1 year ago
I agree the code should give an error when the result doesn't fit into an array.
Arrow format represents arrays of length up to the maximum 32-bit signed integer: https://arrow.apache.org/docs/format/Columnar.html#array-lengths
and pyarrow.repeat
or MakeArrayFromScalar
function in C++ returns an Array, not a ChunkedArray:
https://github.com/apache/arrow/blob/cd1ed18fd1e08912ea47b64edf55be9c046375c4/python/pyarrow/array.pxi#L426
https://github.com/apache/arrow/blob/cd1ed18fd1e08912ea47b64edf55be9c046375c4/r/src/scalar.cpp#L61
If the length is exceeding the limit, it should raise an informative error on the C++ side.
Oh sorry, the last link to the C++ is wrong. I meant to add this:
After talking to @jorisvandenbossche about this issue I would like to add a proposed fix for it in the C++ here (contributions welcome).
The Negative offsets in binary array message is coming from CreateOffsetsBuffer
:
where the value_length
* (the number of repetitions) exceeds the int64
limit.
We could add a check for the overflow in the RepeatedArrayFactory
for binary type:
using MultiplyWithOverflow
, something similar to what we do here:
@AlenkaF Hello.
I have taken an interest in this issue and upon review, I find some aspects to be ambiguous and would like to ask some questions.
It appears that in order to avoid overflow in this issue, the following two scenarios need to be considered:
repetition count
(e.g., 2**31 -1
) greater than int32_t
or less than 0
?repeated value
(e.g., ?
) and repetition count
trigger MultiplyWithOverflow
?Moreover, it seems possible to allocate the repetition count
within the int64_t
range, but when calling the c++ function from python code, is there a type conversion happening, leading the user to discover an error related to int32_t
?
I wanted to test the cases exceeding int32_t
, but I encountered an error like "mimalloc: warning: unable to allocate aligned OS memory directly, fall back to over-allocation (8598323200 bytes, address: 0x300800000, alignment: 67108864, commit: 1)" and was unable to proceed beyond this point.
Considering these points, should the range for repeat size be restricted to int32_t
or should it be limited to int64_t
?
It appears that in order to avoid overflow in this issue, the following two scenarios need to be considered:
- Initially, is the
repetition count
(e.g.,2**31 -1
) greater thanint32_t
or less than0
?- Does the product of the
repeated value
(e.g.,?
) andrepetition count
triggerMultiplyWithOverflow
?
The main thing I would check is that value size multiplied with number of repetitions passes MultiplyWithOverflow
when creating offsets buffer for the array with repeated values.
Moreover, it seems possible to allocate the
repetition count
within theint64_t
range, but when calling the c++ function from python code, is there a type conversion happening, leading the user to discover an error related toint32_t
?
I do not think there are any conversions happening. The int64_t
range is the limit when creating the offset buffer and not a limit to the number of repetitions. Restricting the number of repetitions will not work, I think. For example if the string has a bigger value (length) then the limit for the number of repetitions will be even lower?
Describe the bug, including details regarding any error messages, version, and platform.
pyarrow.repeat silently fails and returns an invalid array when the result cannot fit in a single chunk.
output:
Component(s)
Python