apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.42k stars 3.51k forks source link

[Python] Array construction from numpy array is unclear about zero copy behaviour #28415

Open asfimport opened 3 years ago

asfimport commented 3 years ago

When building an Arrow array from a numpy array it's very confusing from the user point of view that the result is not always a new array.

Under the hood Arrow sometimes reuses the memory if no casting is needed


npa = np.array([1, 2, 3]*3)
arrow_array = pa.array(npa, type=pa.int64())
npa[npa == 2] = 10
print(arrow_array.to_pylist())
# Prints: [1, 10, 3, 1, 10, 3, 1, 10, 3]

and sometimes doesn't if a cast is involved


npa = np.array([1, 2, 3]*3)
arrow_array = pa.array(npa, type=pa.int32())
npa[npa == 2] = 10
print(arrow_array.to_pylist())
# Prints: [1, 2, 3, 1, 2, 3, 1, 2, 3]

For non primite types instead it does always copy


npa = np.array(["a", "b", "c"]*3)
arrow_array = pa.array(npa, type=pa.string())
npa[npa == "b"] = "X"
print(arrow_array.to_pylist())
# Prints: ['a', 'b', 'c', 'a', 'b', 'c', 'a', 'b', 'c']
# Different from numpy array that was modified

This behaviour needs a lot of attention from the user and understanding of what's going on, which makes pyarrow hard to use.

A copy=True/False should be added to pa.array and the default value should probably be copy=True so that by default you can always create an arrow array out of a numpy one (as copy=False  would probably have to throw an exception in some cases where we can't guarantee zero copy, like when building from a Python List)

Reporter: Alessandro Molina / @amol-

Related issues:

Note: This issue was originally created as ARROW-12666. Please see the migration documentation for further details.

asfimport commented 3 years ago

Joris Van den Bossche / @jorisvandenbossche:

copy=False would probably have to throw an exception in some cases where we can't guarantee zero copy, like when building from a Python List

Or copy=False could also not guarantee that no copy is made, but will only try to not make a copy if possible. That's basically the behaviour of the copy keyword in numpy.array(..)

On the general issue, I agree that the current behaviour is not ideal and potentially being confusing/having surprising effects. But I also think it's not that easy to change. I think a lot of people rely on the zero-copy behaviour to avoid unnecessary copies (eg if you just convert to Arrow to then directly write that to Parquet file, then you don't want to make an additional copy).

asfimport commented 1 year ago

Apache Arrow JIRA Bot: This issue was last updated over 90 days ago, which may be an indication it is no longer being actively worked. To better reflect the current state, the issue is being unassigned per project policy. Please feel free to re-take assignment of the issue if it is being actively worked, or if you plan to start that work soon.

LucasG0 commented 9 months ago

As a first step, would having copy=False behaving like "try not to copy" as default behavior be relevant? Benefit is that current users relying on the non-copy behavior would not be affected, while users struggling with the confusing behavior could be enlightened by having a look at copy parameter doc and use copy=True if needed. Downside is that current behavior remains confusing.