apache / datafusion-python

Apache DataFusion Python Bindings
https://datafusion.apache.org/python
Apache License 2.0
323 stars 64 forks source link

working with a struct field inside of a udf fails with "Exception: type_coercion caused by Error during planning: Coercion from ..." #542

Closed dlovell closed 6 months ago

dlovell commented 7 months ago

Describe the bug Working with a struct field inside of a udf fails unless all the struct fields are of type string

To Reproduce

import pandas as pd
import pyarrow.compute as pc
import toolz
from datafusion import (
    SessionContext,
    column,
    functions as f,
    udf,
)

def make_df(n=30):
    return pd.DataFrame(
        {
            "a": pd.date_range(start="2020-01-01", freq="M", periods=n),
            "b": range(n),
            "c": pd.Series(range(n)).add(0.1),
            "d": pd.Series(range(n)).astype(str),
        }
    )
    # ).astype(str)
    # if all struct fields are str type, the failure does not occur

field_name = "c0"
col_name = "bcd"

ctx = SessionContext()
t = ctx.from_pandas(make_df(), "t").select(
    column("a"),
    f.functions.struct(*(column(c) for c in col_name)).alias(col_name),
)
my_udf = udf(
    toolz.curry(pc.struct_field, indices=field_name),
    input_types=[t.schema().field(col_name).type],
    return_type=t.schema().field(col_name).type.field(field_name).type,
    volatility="volatile",
    name="extract_field",
)
ctx.register_udf(my_udf)
t.select(my_udf(column(col_name)))
"""
Exception: type_coercion
caused by
Error during planning: Coercion from [Struct([Field { name: "c0", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "c1", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "c2", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }])] to the signature Exact([Struct([Field { name: "c0", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "c1", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "c2", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }])]) failed.
"""

Expected behavior I would expect no failure to occur, as is the case if you first cast all the data to type str

Additional context Maybe related to #541 The reason I'm trying to pack multiple columns into a single struct column is so that I can simulate running a udaf that accepts multiple columns, which does not currently seem possible

dlovell commented 7 months ago

the underlying issue is probably https://github.com/apache/arrow-datafusion/issues/8118

dlovell commented 6 months ago

This is fixed in 34.0.0