Open Fokko opened 12 months ago
There's the pyarrow.compute.struct_field
kernel which can do these sorts of things for nested structs, but that is much lower level than what you're after here
I'm not quite sure casting is the right name to give this operation, but I think there is room to add an operation like what you're proposing (whether it's called casting or not)
Thanks again for the quick response @lidavidm
The current operation that does this is called .select()
which accepts names and indices. However, this would require traversing the schema for finding all the nested structs (and I'd rather not do it in Python). Also, if there are any valid promotions, another cast would be needed. Therefore I think it makes sense to make it part of the .cast
operation.
Still running into this. I would expect something like below to work:
In [1]: import pyarrow as pa
In [2]: current_schema = pa.schema([
...: ^Ipa.field("x", pa.float32()),
...: ^Ipa.field("y", pa.float32())
...: ])
...:
...: tbl = pa.Table.from_pylist(
...: [
...: {"x": 52.371807, "y": 4.896029},
...: {"x": 52.387386, "y": 4.646219},
...: {"x": 52.078663, "y": 4.288788},
...: ],
...: schema=current_schema,
...: )
...:
...: schema_with_z = pa.schema(
...: [
...: ^I^Ipa.field("x", pa.float32()),
...: ^I^Ipa.field("y", pa.float32()),
...: pa.field("x", pa.float32()),
...: ]
...: )
...:
...: tbl.cast(schema_with_z)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[2], line 23
6 tbl = pa.Table.from_pylist(
7 [
8 {"x": 52.371807, "y": 4.896029},
(...)
12 schema=current_schema,
13 )
15 schema_with_z = pa.schema(
16 [
17 pa.field("x", pa.float32()),
(...)
20 ]
21 )
---> 23 tbl.cast(schema_with_z)
File /opt/homebrew/lib/python3.11/site-packages/pyarrow/table.pxi:3793, in pyarrow.lib.Table.cast()
ValueError: Target schema's field names are not matching the table's field names: ['x', 'y'], ['x', 'y', 'x']
Describe the enhancement requested
For PyIceberg recently, concatenation of tables has been added: https://github.com/apache/arrow/pull/36846 To add new fields I concat the requested schema with the data that was loaded. However, now I'm hitting the next barrier, unable to project the schemas of nested structs.
Bit of context. For the top-level schema it is not an issue because we can select the columns that we need when reading in the table, but it doesn't allow selection of nested columns.
Selecting a subset:
Or in a nested struct:
Any thoughts on adding this? Or can we achieve this in another way?
Component(s)
Python