apache / datafusion-python

Apache DataFusion Python Bindings
https://datafusion.apache.org/python
Apache License 2.0
372 stars 77 forks source link

RFC: Re-work some DataFrame APIs #875

Open ion-elgreco opened 1 month ago

ion-elgreco commented 1 month ago

Some API's feel a bit un-intuitive, I think Polars has really excelled at this area. My suggestion is we re-use some of those APIs or take some inspiration of them, changes I am proposing (I am happy to work on these areas especially with datafusion-ray becoming a thing):

Can remove these:

Missing APIs:

Optional but useful:

emgeee commented 1 month ago

These proposals generally sound good to me. I do think care should be taken around the first two points since Dataframe cache() and collect() methods shadow the underlying rust library and renaming those methods at the python level would be immensely confusing for those coming from the rust library or those seeking to better understand the python layer.

The other suggestion I might add is to keep Datafusion.with_column() but make it a simple wrapper around Datafusion.with_columns().

ion-elgreco commented 2 weeks ago

Asof joins are pending: https://github.com/apache/datafusion/issues/318