apache / datafusion-python

Apache DataFusion Python Bindings
https://datafusion.apache.org/python
Apache License 2.0
320 stars 63 forks source link

Display of built in window functions do not work with struct elements #647

Closed timsaucer closed 2 months ago

timsaucer commented 2 months ago

Describe the bug When attempting to call show() on a DataFrame that contains a built in window function on a column that has struct elements, it produces the error Compute error: concat requires input of at least one array. However other functions such as count() do not have issues. I have less experience with DataFusion, so I just expected count() to do a full evaluation like it does in pyspark, so it's possible that my assumption is incorrect in that having any bearing on this error.

To Reproduce This minimal example can reproduce the window function working properly on a simple element type and failing with a very simple struct.

import pyarrow as pa
from datafusion import SessionContext
import datafusion.functions as F

# taken from datafusion/tests/test_dataframe.py
def struct_df():
    ctx = SessionContext()

    # create a RecordBatch and a new DataFrame from it
    batch = pa.RecordBatch.from_arrays(
        [pa.array([{"c": 1}, {"c": 2}, {"c": 3}]), pa.array([4, 5, 6])],
        names=["a", "b"],
    )

    return ctx.create_dataframe([[batch]])

df = struct_df()

df.show()

df.select(F.col("a"), F.col("b"), F.window("lag", [F.col("b")]).alias("lag_b")).show()

print("Calling count on lag a: ", df.select(F.col("a"), F.col("b"), F.window("lag", [F.col("a")]).alias("lag_a")).count())

df.select(F.col("a"), F.col("b"), F.window("lag", [F.col("a")]).alias("lag_a")).show()

Produces the following output:

DataFrame()
+--------+---+
| a      | b |
+--------+---+
| {c: 1} | 4 |
| {c: 2} | 5 |
| {c: 3} | 6 |
+--------+---+
DataFrame()
+--------+---+-------+
| a      | b | lag_b |
+--------+---+-------+
| {c: 1} | 4 |       |
| {c: 2} | 5 | 4     |
| {c: 3} | 6 | 5     |
+--------+---+-------+
Calling count on lag a:  3
Traceback (most recent call last):
  File "/Users/tsaucer/src/arrow-datafusion-python/example_lag_struct.py", line 25, in <module>
    df.select(F.col("a"), F.col("b"), F.window("lag", [F.col("a")]).alias("lag_a")).show()
Exception: Arrow error: Compute error: concat requires input of at least one array

In searching the web there was a similar error thrown that this old MR resolved in sort operations: https://github.com/apache/arrow/pull/9275/files#diff-3ee8e6ac2472badc7bb448c360f56ed60f06a787d1f45ea589d9e213eaf2ae82

Expected behavior Calling show() on a window function with a struct column type should operate similar to simple column types.

Additional context I'm willing to work on this myself, but I'm not familiar with the internals of the plan execution. I've looked around myself to see if I can find anything obvious, but nothing is jumping out at me. If you can provide any directions or pointers, I would appreciate it.

timsaucer commented 2 months ago

Closing in favor of https://github.com/apache/datafusion/issues/10328