apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.51k stars 3.53k forks source link

Allow projection of schemas/structs #38615

Open Fokko opened 12 months ago

Fokko commented 12 months ago

Describe the enhancement requested

For PyIceberg recently, concatenation of tables has been added: https://github.com/apache/arrow/pull/36846 To add new fields I concat the requested schema with the data that was loaded. However, now I'm hitting the next barrier, unable to project the schemas of nested structs.

Bit of context. For the top-level schema it is not an issue because we can select the columns that we need when reading in the table, but it doesn't allow selection of nested columns.

Selecting a subset:

➜  Desktop python3
Python 3.11.6 (main, Oct  2 2023, 13:45:54) [Clang 15.0.0 (clang-1500.0.40.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow as pa
>>> 
>>> current_schema = pa.schema([pa.field("x", pa.float32()), pa.field("y", pa.float32())])
>>> tbl = pa.Table.from_pylist(
...     [
...         {"x": 52.371807, "y": 4.896029},
...         {"x": 52.387386, "y": 4.646219},
...         {"x": 52.078663, "y": 4.288788},
...     ],
...     schema=current_schema,
... )
>>> schema_with_z = pa.schema(
...     [
...         pa.field("x", pa.float32()),
...     ]
... )
>>> tbl.cast(schema_with_z)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow/table.pxi", line 3793, in pyarrow.lib.Table.cast
ValueError: Target schema's field names are not matching the table's field names: ['x', 'y'], ['x']

Or in a nested struct:

➜  Desktop python3
Python 3.11.6 (main, Oct  2 2023, 13:45:54) [Clang 15.0.0 (clang-1500.0.40.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow as pa
>>> 
>>> current_schema = pa.schema(
...     pa.field(
...         "location",
...         pa.struct([pa.field("x", pa.float32()), pa.field("y", pa.float32())]),
...     )
... )
>>> 
>>> tbl = pa.Table.from_pylist(
...     [
...         {"location": {"x": 52.371807, "y": 4.896029}},
...         {"location": {"x": 52.387386, "y": 4.646219}},
...         {"location": {"x": 52.078663, "y": 4.288788}},
...     ],
...     schema=current_schema,
... )
>>> schema_without_x = pa.schema(
...     pa.field(
...         "location",
...         pa.struct(
...             [
...                 pa.field("x", pa.float32()),
...             ]
...         ),
...     )
... )
>>> tbl.cast(schema_without_x)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow/table.pxi", line 3793, in pyarrow.lib.Table.cast
ValueError: Target schema's field names are not matching the table's field names: ['x', 'y'], ['x']

Any thoughts on adding this? Or can we achieve this in another way?

Component(s)

Python

lidavidm commented 12 months ago

There's the pyarrow.compute.struct_field kernel which can do these sorts of things for nested structs, but that is much lower level than what you're after here

I'm not quite sure casting is the right name to give this operation, but I think there is room to add an operation like what you're proposing (whether it's called casting or not)

Fokko commented 12 months ago

Thanks again for the quick response @lidavidm

The current operation that does this is called .select() which accepts names and indices. However, this would require traversing the schema for finding all the nested structs (and I'd rather not do it in Python). Also, if there are any valid promotions, another cast would be needed. Therefore I think it makes sense to make it part of the .cast operation.

Fokko commented 10 months ago

Still running into this. I would expect something like below to work:

In [1]: import pyarrow as pa

In [2]: current_schema = pa.schema([
   ...: ^Ipa.field("x", pa.float32()),
   ...: ^Ipa.field("y", pa.float32())
   ...: ])
   ...: 
   ...: tbl = pa.Table.from_pylist(
   ...:     [
   ...:         {"x": 52.371807, "y": 4.896029},
   ...:         {"x": 52.387386, "y": 4.646219},
   ...:         {"x": 52.078663, "y": 4.288788},
   ...:     ],
   ...:     schema=current_schema,
   ...: )
   ...: 
   ...: schema_with_z = pa.schema(
   ...:     [
   ...: ^I^Ipa.field("x", pa.float32()),
   ...: ^I^Ipa.field("y", pa.float32()),
   ...:         pa.field("x", pa.float32()),
   ...:     ]
   ...: )
   ...: 
   ...: tbl.cast(schema_with_z)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[2], line 23
      6 tbl = pa.Table.from_pylist(
      7     [
      8         {"x": 52.371807, "y": 4.896029},
   (...)
     12     schema=current_schema,
     13 )
     15 schema_with_z = pa.schema(
     16     [
     17         pa.field("x", pa.float32()), 
   (...)
     20     ]
     21 )
---> 23 tbl.cast(schema_with_z)

File /opt/homebrew/lib/python3.11/site-packages/pyarrow/table.pxi:3793, in pyarrow.lib.Table.cast()

ValueError: Target schema's field names are not matching the table's field names: ['x', 'y'], ['x', 'y', 'x']