RFC: Re-work some DataFrame APIs

ion-elgreco commented 1 month ago

Some API's feel a bit un-intuitive, I think Polars has really excelled at this area. My suggestion is we re-use some of those APIs or take some inspiration of them, changes I am proposing (I am happy to work on these areas especially with datafusion-ray becoming a thing):

[ ] - DataFrame.cache() -> DataFrame ===> DataFrame.collect() -> DataFrame
[ ] - DataFrame.collect() -> list[pyarrow.RecordBatch] ===> DataFrame.to_batches() -> list[pyarrow.RecordBatch]
[x] - DataFrame.join ===> DataFrame.join(right: DataFrame, on: str | sequence[str] | None, left_on: str | sequence[str] | None, right_on: str | sequence[str] | None
[ ] - DataFrame.schema -> pyarrow.Schema ===> DataFrame.schema -> datafusion.Schema Map Rust arrow types to dafusion-py types
[x] - DataFrame.with_column ===> DataFrame.with_columns Allow multiple inputs as exprs or key value pairs
[x] - DataFrame.with_column_renamed ===> DataFrame.rename() a simple rename is clear enough and should allow a dict as input
[ ] - DataFrame.aggregate ===> DataFrame.group_by().agg() this feels more natural coming from PySpark/Polars/Pandas

Can remove these:

[x] -DataFrame.select_columns already covered by DataFrame.select

Missing APIs:

[x] - DataFrame.cast to cast on top level a single or multiple columns
[x] - DataFrame.drop to drop columns, instead of writing a very verbose select
[x] - DataFrame.fill_null/fill_nan to fill null or nan values
[ ] - DataFrame.interpolate interpolate values per col
[ ] - Asof join missing in df api?
[x] - Join on (inequality join)
[x] - DataFrame.head/tail
[ ] - DataFrame.pivot
[ ] - DataFrame.unpivot

Optional but useful:

[ ] - DataFrame.with_row_idx

emgeee commented 1 month ago

These proposals generally sound good to me. I do think care should be taken around the first two points since Dataframe cache() and collect() methods shadow the underlying rust library and renaming those methods at the python level would be immensely confusing for those coming from the rust library or those seeking to better understand the python layer.

The other suggestion I might add is to keep Datafusion.with_column() but make it a simple wrapper around Datafusion.with_columns().

ion-elgreco commented 2 weeks ago

Asof joins are pending: https://github.com/apache/datafusion/issues/318

apache / datafusion-python

RFC: Re-work some DataFrame APIs #875