apache / datafusion-python

Apache DataFusion Python Bindings
https://datafusion.apache.org/python
Apache License 2.0
320 stars 63 forks source link

Expose named_struct in python #692

Closed timsaucer closed 1 month ago

timsaucer commented 1 month ago

Is your feature request related to a problem or challenge? Please describe what you are trying to do. Currently we can only create a struct of expressions using datafusion.functions.struct which assigns fixed field names of c0, c1, and so on. This is difficult to work with. In the rust implementation there is a named_struct function which would serve the purpose.

Describe the solution you'd like In an ideal world, the name of the field in a struct would come from the name of the expression. It would be great to do something like

df = df.with_column("d", F.struct(col("a"), col("b"), col("c")))

And then the struct would contain field names a, b, and c.

From a brief look at the code this may not be simple to implement. If that is not feasible, I would at least like to expose the named_struct function in the python code.

Describe alternatives you've considered No additional alternatives I have considered beyond the two described above.

Additional context Minimal example showing current state:

from datafusion import SessionContext, col, functions as F
import pyarrow as pa

ctx = SessionContext()

batch = pa.RecordBatch.from_arrays(
    [pa.array([1, 2, 3]), pa.array([4, 5, 6]), pa.array([7, 8, 9])],
    names=["a", "b", "c"],
)

df = ctx.create_dataframe([[batch]])

df = df.with_column("d", F.struct(col("a"), col("b"), col("c")))

df.show()

Creates

DataFrame()
+---+---+---+-----------------------+
| a | b | c | d                     |
+---+---+---+-----------------------+
| 1 | 4 | 7 | {c0: 1, c1: 4, c2: 7} |
| 2 | 5 | 8 | {c0: 2, c1: 5, c2: 8} |
| 3 | 6 | 9 | {c0: 3, c1: 6, c2: 9} |
+---+---+---+-----------------------+