apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.33k stars 3.48k forks source link

[Python] Make Table.cast(schema) more flexible regarding order of fields / missing fields? #27425

Open asfimport opened 3 years ago

asfimport commented 3 years ago

Currently, Table.cast requires a new schema with exactly the same names and same order of those names (it simply does a self.schema.names != target_schema.names: raise ... check). Example:


>>> table = pa.table({'a': [1, 2, 3], 'b': [.1, .2, .3]})
>>> table
pyarrow.Table
a: int64
b: double

>>> schema = pa.schema([('a', pa.int32()), ('b', pa.float32())])
>>> table.cast(schema)
pyarrow.Table
a: int32
b: float

>>> schema2 = pa.schema([('b', pa.float32()), ('a', pa.int32())])
>>> table.cast(schema2)
....
ValueError: Target schema's field names are not matching the table's field names: ['a', 'b'], ['b', 'a']

Do we want to make this more flexible? Allow different order? (and the follow order of the passed schema or of the original table?) Allow missing fields? (and then use the fields of the schema to "subset" as well?)

Reporter: Joris Van den Bossche / @jorisvandenbossche

Note: This issue was originally created as ARROW-11553. Please see the migration documentation for further details.

asfimport commented 3 years ago

Antoine Pitrou / @pitrou: Such options could be added to CastOptions I guess.

asfimport commented 3 years ago

Joris Van den Bossche / @jorisvandenbossche: The current implementation of Table.cast is actually not using the compute cast, but iteratively casting each column (so in this case, CastOptions is not involved).

I don't know if compute::Cast is expected to support Tables?


>>> import pyarrow.compute as pc
>>> table = pa.table({'a': [1, 2, 3], 'b': [.1, .2, .3]})
>>> schema = pa.schema([('a', pa.int32()), ('b', pa.float32())])
>>> pc.cast(table, schema)
...
TypeError: DataType expected, got <class 'pyarrow.lib.Schema'>
>>> pc.cast(table, pa.float64())
Segmentation fault (core dumped)
asfimport commented 3 years ago

Antoine Pitrou / @pitrou:

I don't know if compute::Cast is expected to support Tables?

I have no idea. If the functionality is useful, it should be available from C++ anyhow, though.

cc @bkietz

asfimport commented 3 years ago

Joris Van den Bossche / @jorisvandenbossche: In compute terms, and if we allow field reordering/selection as well, that's probably becoming a projection.

asfimport commented 3 years ago

Antoine Pitrou / @pitrou: Well, there's already a compute function named "project", though I have no idea whether it fits the bill.

asfimport commented 3 years ago

Ben Kietzman / @bkietz: The project function is primarily intended for use in the context of dataset expressions, since expressions (and not just calls to cast) can be provided which are evaluated to produce the resulting columns (see ARROW-11174). It would not be very helpful in for this use case since it would only serve to collect and name columns which is pretty trivial in Python anyway. On the other hand, it might be a useful pre-dataframe feature to expedite evaluation of a dataset expression against a table, which would allow you to write something like


new_table = ds.project({
  'renamed_a': ds.field('a'),
  'b_float': ds.cast(ds.field('b'), to_type='float32'),
  'c_plus_d': ds.field('c') + ds.field('d'),
}).evaluate(some_table)