apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.37k stars 3.49k forks source link

[Python] Support inferring nested ndarray with ndim > 1 #22081

Open asfimport opened 5 years ago

asfimport commented 5 years ago

Follow up work to ARROW-4350

Reporter: Wes McKinney / @wesm

Related issues:

Note: This issue was originally created as ARROW-5645. Please see the migration documentation for further details.

asfimport commented 5 years ago

Simeon H.K. Fitch: This GitHub issue describes the desired end state:

https://github.com/apache/arrow/issues/4802

This feature is important for users of PySpark who want to construct tensors to feed them to ML libraries such as Keras via pandas_udfs.

asfimport commented 3 years ago

Antoine Pitrou / @pitrou: Nested Python lists are now inferred correctly, but we still lack inference but nested ndarrays with "object" dtype.

asfimport commented 2 years ago

Joris Van den Bossche / @jorisvandenbossche: While we also don't yet support converting a nd array to fixed sized list array in pa.array(..), you can actually convert it manually:


In [39]: arr = np.arange(30).reshape(10, 3)

In [40]: arr
Out[40]: 
array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
...

In [41]: pa.array(arr)
...
ArrowInvalid: only handle 1-dimensional arrays

In [42]: pa.FixedSizeListArray.from_arrays(arr.ravel(order="C"), arr.shape[1])
Out[42]: 
<pyarrow.lib.FixedSizeListArray object at 0x7f4025471040>
[
  [
    0,
    1,
    2
  ],
  [
    3,
    4,
    5
  ],
  [
    6,
    7,
    8
  ],
...
asfimport commented 2 years ago

Joris Van den Bossche / @jorisvandenbossche:

Nested Python lists are now inferred correctly, but we still lack inference but nested ndarrays with "object" dtype.

@pitrou that actually seems to work ?


In [56]: arr = np.arange(30).reshape(10, 3)

In [57]: arr = np.array(pd.Series(list(arr), dtype=object))

In [58]: arr
Out[58]: 
array([array([0, 1, 2]), array([3, 4, 5]), array([6, 7, 8]),
       array([ 9, 10, 11]), array([12, 13, 14]), array([15, 16, 17]),
       array([18, 19, 20]), array([21, 22, 23]), array([24, 25, 26]),
       array([27, 28, 29])], dtype=object)

In [59]: pa.array(arr)
Out[59]: 
<pyarrow.lib.ListArray object at 0x7f406db85b20>
[
  [
    0,
    1,
    2
  ],
  [
    3,
    4,
    5
  ],
  [
    6,
    7,
    8
  ],
...
asfimport commented 2 years ago

Antoine Pitrou / @pitrou: Yes, it's quite possible that it would work now.

NickCrews commented 3 months ago

Can confirm this is still broken, the given "working" example is a 1d numpy array of dtype object, not a true 2d array of dtype int