Open asfimport opened 3 years ago
Antoine Pitrou / @pitrou:
Such options could be added to CastOptions
I guess.
Joris Van den Bossche / @jorisvandenbossche:
The current implementation of Table.cast is actually not using the compute cast, but iteratively casting each column (so in this case, CastOptions
is not involved).
I don't know if compute::Cast
is expected to support Tables?
>>> import pyarrow.compute as pc
>>> table = pa.table({'a': [1, 2, 3], 'b': [.1, .2, .3]})
>>> schema = pa.schema([('a', pa.int32()), ('b', pa.float32())])
>>> pc.cast(table, schema)
...
TypeError: DataType expected, got <class 'pyarrow.lib.Schema'>
>>> pc.cast(table, pa.float64())
Segmentation fault (core dumped)
Antoine Pitrou / @pitrou:
I don't know if
compute::Cast
is expected to support Tables?
I have no idea. If the functionality is useful, it should be available from C++ anyhow, though.
cc @bkietz
Joris Van den Bossche / @jorisvandenbossche: In compute terms, and if we allow field reordering/selection as well, that's probably becoming a projection.
Antoine Pitrou / @pitrou: Well, there's already a compute function named "project", though I have no idea whether it fits the bill.
Ben Kietzman / @bkietz:
The project
function is primarily intended for use in the context of dataset expressions, since expressions (and not just calls to cast
) can be provided which are evaluated to produce the resulting columns (see ARROW-11174). It would not be very helpful in for this use case since it would only serve to collect and name columns which is pretty trivial in Python anyway. On the other hand, it might be a useful pre-dataframe feature to expedite evaluation of a dataset expression against a table, which would allow you to write something like
new_table = ds.project({
'renamed_a': ds.field('a'),
'b_float': ds.cast(ds.field('b'), to_type='float32'),
'c_plus_d': ds.field('c') + ds.field('d'),
}).evaluate(some_table)
Currently,
Table.cast
requires a new schema with exactly the same names and same order of those names (it simply does aself.schema.names != target_schema.names: raise ...
check). Example:Do we want to make this more flexible? Allow different order? (and the follow order of the passed schema or of the original table?) Allow missing fields? (and then use the fields of the schema to "subset" as well?)
Reporter: Joris Van den Bossche / @jorisvandenbossche
Note: This issue was originally created as ARROW-11553. Please see the migration documentation for further details.