Closed AlexanderVR closed 1 year ago
@AlexanderVR I don't totally follow this-- why did the Pandas stuff come out too?
@jwills because I don't understand why it was there in the first place. Hence my question :-D
The line con.execute("create table {{...}} as select * from df")
already will convert a DuckdbPyRelation, pandas DataFrame, or pyarrow Table into the duckdb table -- whatever the df
variable is. Even a Polars Dataframe. Isn't this all we want when materializing a python model?
The old logic was first checking to see if it was passed a DuckdbPyRelation. If it was, it would then convert to pandas or pyarrow, whichever was available. Then the create table {{...}} as select * from df
is taking this pyarrow Table or pandas DataFrame and convert it back into duckdb format!
Ah, I think I get you-- you're saying that there is no need for the checks, because there isn't really anything for them to do; either the conversion to a DuckDB table works or it doesn't?
Yes that is correct in that either the conversion works or it doesn't. But my larger question was why the choice of duckdbPyRelation -> pandas/pyarrow -> duckdb table
instead of duckdbPyRelation -> duckdb table
directly?
The former conversions will have issues with larger-than-memory datasets. They also cause issues with Union data types because duckdb will not convert Unions to/from arrow format https://github.com/duckdb/duckdb/issues/1742
yeah these are all good points and I don't have a good answer; going to merge this shortly. Thank you!
I expected this example to work, but it didn't:
def model(dbt, session):
dbt.config(materialized = "table")
df = dbt.ref("my_seed")
return df
I got a Runtime Error:
Python model failed:
Invalid Input Error: Python Object "df" of type "DuckDBPyRelation" found on line "/.../tmpf2.py:76" not suitable for replacement scans.
Make sure that "df" is either a pandas.DataFrame, or pyarrow Table, Dataset, RecordBatchReader, or Scanner
My return statement needed to be this instead:
return df.df()
or this:
return df.arrow()
I wonder if this is why that isinstance(df, duckdb.DuckDBPyRelation)
logic was in there?
Ah yeah I bet you’re right @dbeatty10– but it seems like we can/should just treat DuckDBPyRelations as valid objects (albeit ones that need to be handled a bit differently)
Multi-threaded writing of python models returning pyarrow.Table fails due to an upstream bug: https://github.com/duckdb/duckdb/issues/6584
pyarrow.dataset
on materialization.pyarrow.Table/pandas.DataFrame
and back. This will also enablepolars
or any other dataframe-like models to work within dbt-duckdb.@jwills @tomsej is there some subtle detail I'm missing about why this round-trip logic was here in the first place? Some old bug with duckdb not finding bound DuckdbPyRelation variables?
Related: https://github.com/duckdb/duckdb/issues/5038 was fixed as of duckdb 0.6.0 so might want to go this route instead of relying on duckdb's magic variable binding.