apache / datafusion-python

Apache DataFusion Python Bindings
https://datafusion.apache.org/python
Apache License 2.0
320 stars 63 forks source link

Default window frames to not match PostgreSQL #688

Open timsaucer opened 1 month ago

timsaucer commented 1 month ago

Describe the bug When no window frame is specified in the python implementation, we default to unbounded preceeding to current row. If we are to follow PostgreSQL implementation then we should set this value when order_by is specified and otherwise default to unbounded preceeding to unbounded following.

To Reproduce

from datafusion import SessionContext, WindowFrame, col, lit, functions as F
import pyarrow as pa

ctx = SessionContext()

# create a RecordBatch and a new DataFrame from it
batch = pa.RecordBatch.from_arrays(
    [pa.array([1.0, 10.0, 20.0])],
    names=["a"],
)

df = ctx.create_dataframe([[batch]])

window_frame = WindowFrame("rows", None, None)

df = df.select(col("a"), F.window("avg", [col("a")]).alias('no_frame'), F.window("avg", [col("a")], window_frame=window_frame).alias('with_frame'))

df.show()

Produces:

DataFrame()
+------+--------------------+--------------------+
| a    | no_frame           | with_frame         |
+------+--------------------+--------------------+
| 1.0  | 1.0                | 10.333333333333334 |
| 10.0 | 5.5                | 10.333333333333334 |
| 20.0 | 10.333333333333334 | 10.333333333333334 |
+------+--------------------+--------------------+

Expected behavior When order_by is not specified, default to unbounded preceeding to unbounded following.

Additional context The offending line of code appears to be here:

https://github.com/apache/datafusion-python/blob/main/src/functions.rs#L230

timsaucer commented 1 month ago

This seems like a very easy fix. I think we just match on order_by.is_some() and pick the appropriate default. To close, I think at a minimum we would want (1) a unit test to cover both options and (2) text in the window function that describes this behavior.

timsaucer commented 1 month ago

Based on recommendation on discord, we may want to use WindowFrame::new() which has the logic already for checking if order_by exists.

timsaucer commented 1 month ago

From @Michael-J-Ward

Doing a little archaelogy on that:

This is the PR where window_frame switched from None to WindowFrame::new(order_by.is_some());

The PR discussion has some talk on test regressions related to it.

https://github.com/apache/datafusion-python/pull/115/files#diff-676f1190ad1b28e57e9eac75de61cccc59420615e3da10f91a493fb48d0711ff