apache / datafusion-python

Apache DataFusion Python Bindings
https://datafusion.apache.org/python
Apache License 2.0
323 stars 64 forks source link

to_*() on a dataframe with a struct col fails with "ArrowInvalid: Schema at index 0 was different" #541

Closed dlovell closed 6 months ago

dlovell commented 7 months ago

Describe the bug calling to_* on a dataframe with a struct column fails unless all the struct fields are of type string

To Reproduce Steps to reproduce the behavior:

import pandas as pd
from datafusion import (
    column,
    functions as f,
    SessionContext,
)

def make_df(n=30):
    return pd.DataFrame(
        {
            "a": pd.date_range(start="2020-01-01", freq="M", periods=n),
            "b": range(n),
            "c": pd.Series(range(n)).add(0.1),
            "d": pd.Series(range(n)).astype(str),
        }
    )
    # ).astype(str)
    # if all struct fields are str type, the failure does not occur

ctx = SessionContext()
t = ctx.from_pandas(make_df(), "t").select(
    column("a"),
    f.functions.struct(column("b"), column("c"), column("d")).alias("bcd"),
)
# this fails, as do all invocations of to_* methods
t.to_pandas()

Expected behavior I would expect no failure to occur, as is the case if you first cast all the data to type str

dlovell commented 7 months ago

the underlying issue is probably https://github.com/apache/arrow-datafusion/issues/8118

iammax commented 7 months ago

I was having a similar issue with version 31.0.0. Small example snip:

import datafusion as dfu
ctx = dfu.SessionContext()
ctx.register_csv('delta', 'test.csv')
result = ctx.sql('SELECT col1, COUNT(DISTINCT col2) FROM delta GROUP BY col1')

This assigns result to a datafusion.Dataframe object as expected. I can see it has the correct values in it by printing it out in terminal/jupyter. However, if I do result.to_polars() (or to_anything else) I get the same error as the original post.

ArrowInvalid: Schema at index 0 was different: 
col1: int64
COUNT(DISTINCT delta.col2): int64
vs
delta.col1: int64
COUNT(DISTINCT delta.col2): int64

However it works in version 33.0,0 (I think that's the current version), so I assume there was a fix.

dlovell commented 6 months ago

This is fixed in 34.0.0