apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.35k stars 3.49k forks source link

[Python] automatic type inference for arrays of tuples #16725

Open asfimport opened 5 years ago

asfimport commented 5 years ago

Arrays of tuples are support to be converted to either ListArray or StructArray, if you specify the type explicitly:


In [6]: pa.array([(1, 2), (3, 4, 5)], type=pa.list_(pa.int64())) 
Out[6]: 
<pyarrow.lib.ListArray object at 0x7f1b01a4d408>
[
  [
    1,
    2
  ],
  [
    3,
    4,
    5
  ]
]

In [7]: pa.array([(1, 2), (3, 4)], type=pa.struct([('a', pa.int64()), ('b', pa.int64())]))
Out[7]: 
<pyarrow.lib.StructArray object at 0x7f1b01a51b88>
-- is_valid: all not null
-- child 0 type: int64
  [
    1,
    3
  ]
-- child 1 type: int64
  [
    2,
    4
  ]

But not when no type is specified:


In [8]: pa.array([(1, 2), (3, 4)])                                                                                                                            
---------------------------------------------------------------------------
ArrowInvalid                              Traceback (most recent call last)
<ipython-input-8-ab2d80c7486d> in <module>
----> 1 pa.array([(1, 2), (3, 4)])

~/scipy/repos/arrow/python/pyarrow/array.pxi in pyarrow.lib.array()

~/scipy/repos/arrow/python/pyarrow/array.pxi in pyarrow.lib._sequence_to_array()

~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowInvalid: Could not convert (1, 2) with type tuple: did not recognize Python value type when inferring an Arrow data type

Do we want to do automatic type inference for tuples as well? (defaulting to the ListArray case, just as arrays of python lists are supported) Or was there a specific reason to not support this by default?

Reporter: Joris Van den Bossche / @jorisvandenbossche

Related issues:

Note: This issue was originally created as ARROW-5287. Please see the migration documentation for further details.

asfimport commented 5 years ago

Antoine Pitrou / @pitrou: Since it's ambiguous, I'm not sure it's a good idea to support it. The working inference case for list arrays is a list of lists:


>>> pa.array([[1,2,3],[4,5]])                                                                                                                                         
<pyarrow.lib.ListArray object at 0x7f114319eb38>
[
  [
    1,
    2,
    3
  ],
  [
    4,
    5
  ]
]
asfimport commented 5 years ago

Joris Van den Bossche / @jorisvandenbossche: Yes, I understand the "ambiguous" reason, but on the other hand, StructArray is not really an option as default since for that the struct names need to be known.

Doing it automatically would allow to save such dataframes to Parquet out of the box (from ARROW-4814), but of course, you can always specify the schema manually.

In general, it would be nice to have an error message that points people towards specifying a list or struct type if you have tuples as data. But I assume this is not that easy, as the error message looks like a generic one where the value and type is filled in.