apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.43k stars 3.51k forks source link

[Python] Incorrect importing of buffer protocol objects #43792

Open kylebarron opened 1 month ago

kylebarron commented 1 month ago

Describe the bug, including details regarding any error messages, version, and platform.

pyarrow seems to be applying some upcasting when importing data via the buffer protocol. This was unexpected behavior to me and could be considered a bug.

pyarrow seems to cast:

As expected:

import numpy as np
import pyarrow as pa

arr = np.array([1.0, 2.0, 3.0], dtype=np.float64)
pa.array(memoryview(arr))
# <pyarrow.lib.DoubleArray object at 0x1232429e0>
# [
#   1,
#   2,
#   3
# ]

Unexpected casts:

arr = np.array([1.0, 2.0, 3.0], dtype=np.float32)
pa.array(memoryview(arr))
# <pyarrow.lib.DoubleArray object at 0x1232e0a00>
# [
#   1,
#   2,
#   3
# ]

arr = np.array(object=[1.0, 2.0, 3.0], dtype=np.uint64)
pa.array(memoryview(arr))
# <pyarrow.lib.Int64Array object at 0x1232e26e0>
# [
#   1,
#   2,
#   3
# ]

arr = np.array(object=[1.0, 2.0, 3.0], dtype=np.uint32)
pa.array(memoryview(arr))
# <pyarrow.lib.Int64Array object at 0x1232e1de0>
# [
#   1,
#   2,
#   3
# ]

Component(s)

Python

jorisvandenbossche commented 1 month ago

Related issue: https://github.com/apache/arrow/issues/38137 (it also shows a workaround how you can currently zero-copy convert such object to arrow)

While this indeed seems unexpected, the underlying issue is that we simply don't have specific support for objects implementing the buffer protocol, but only for very specifically numpy arrays or pandas array-likes (and as a result, we see the memoryview as a generic python sequence, essentially converting it to a list of python floats before converting to arrow).

We should expand pa.array() to support objects implementing the buffer protocol (and eg in that case convert to a numpy array and use the code path for numpy). I am not sure you can easily check from python if an object supports the buffer protocol, but given this function lives in cython, I assume we can use something like PyObject_CheckBuffer