Open ion-elgreco opened 1 month ago
These proposals generally sound good to me. I do think care should be taken around the first two points since Dataframe cache()
and collect()
methods shadow the underlying rust library and renaming those methods at the python level would be immensely confusing for those coming from the rust library or those seeking to better understand the python layer.
The other suggestion I might add is to keep Datafusion.with_column()
but make it a simple wrapper around Datafusion.with_columns()
.
Asof joins are pending: https://github.com/apache/datafusion/issues/318
Some API's feel a bit un-intuitive, I think Polars has really excelled at this area. My suggestion is we re-use some of those APIs or take some inspiration of them, changes I am proposing (I am happy to work on these areas especially with datafusion-ray becoming a thing):
DataFrame.cache() -> DataFrame
===>DataFrame.collect() -> DataFrame
DataFrame.collect() -> list[pyarrow.RecordBatch]
===>DataFrame.to_batches() -> list[pyarrow.RecordBatch]
DataFrame.join
===>DataFrame.join(right: DataFrame, on: str | sequence[str] | None, left_on: str | sequence[str] | None, right_on: str | sequence[str] | None
DataFrame.schema -> pyarrow.Schema
===>DataFrame.schema -> datafusion.Schema
Map Rust arrow types to dafusion-py typesDataFrame.with_column
===>DataFrame.with_columns
Allow multiple inputs as exprs or key value pairsDataFrame.with_column_renamed
===>DataFrame.rename()
a simple rename is clear enough and should allow a dict as inputDataFrame.aggregate
===>DataFrame.group_by().agg()
this feels more natural coming from PySpark/Polars/PandasCan remove these:
DataFrame.select_columns
already covered byDataFrame.select
Missing APIs:
DataFrame.cast
to cast on top level a single or multiple columnsDataFrame.drop
to drop columns, instead of writing a very verbose selectDataFrame.fill_null
/fill_nan
to fill null or nan valuesDataFrame.interpolate
interpolate values per colDataFrame.head/tail
DataFrame.pivot
DataFrame.unpivot
Optional but useful:
DataFrame.with_row_idx