fix: handle multithreaded writes of pyarrow models

duckdb / dbt-duckdb

dbt (http://getdbt.com) adapter for DuckDB (http://duckdb.org)

Apache License 2.0

895 stars 83 forks source link

fix: handle multithreaded writes of pyarrow models #114

Closed AlexanderVR closed 1 year ago

AlexanderVR commented 1 year ago

Multi-threaded writing of python models returning pyarrow.Table fails due to an upstream bug: https://github.com/duckdb/duckdb/issues/6584

add in the workaround to import pyarrow.dataset on materialization.
remove the round-trip conversion of python models returning a duckdbPyRelation to a pyarrow.Table/pandas.DataFrame and back. This will also enable polars or any other dataframe-like models to work within dbt-duckdb.

@jwills @tomsej is there some subtle detail I'm missing about why this round-trip logic was here in the first place? Some old bug with duckdb not finding bound DuckdbPyRelation variables?

Related: https://github.com/duckdb/duckdb/issues/5038 was fixed as of duckdb 0.6.0 so might want to go this route instead of relying on duckdb's magic variable binding.

jwills commented 1 year ago

@AlexanderVR I don't totally follow this-- why did the Pandas stuff come out too?

AlexanderVR commented 1 year ago

@jwills because I don't understand why it was there in the first place. Hence my question :-D

The line con.execute("create table {{...}} as select * from df") already will convert a DuckdbPyRelation, pandas DataFrame, or pyarrow Table into the duckdb table -- whatever the df variable is. Even a Polars Dataframe. Isn't this all we want when materializing a python model?

The old logic was first checking to see if it was passed a DuckdbPyRelation. If it was, it would then convert to pandas or pyarrow, whichever was available. Then the create table {{...}} as select * from df is taking this pyarrow Table or pandas DataFrame and convert it back into duckdb format!

jwills commented 1 year ago

Ah, I think I get you-- you're saying that there is no need for the checks, because there isn't really anything for them to do; either the conversion to a DuckDB table works or it doesn't?

AlexanderVR commented 1 year ago

Yes that is correct in that either the conversion works or it doesn't. But my larger question was why the choice of duckdbPyRelation -> pandas/pyarrow -> duckdb table instead of duckdbPyRelation -> duckdb table directly?

The former conversions will have issues with larger-than-memory datasets. They also cause issues with Union data types because duckdb will not convert Unions to/from arrow format https://github.com/duckdb/duckdb/issues/1742

jwills commented 1 year ago

yeah these are all good points and I don't have a good answer; going to merge this shortly. Thank you!

dbeatty10 commented 1 year ago

I expected this example to work, but it didn't:

def model(dbt, session):
    dbt.config(materialized = "table")

    df = dbt.ref("my_seed")

    return df

I got a Runtime Error:

Python model failed:
Invalid Input Error: Python Object "df" of type "DuckDBPyRelation" found on line "/.../tmpf2.py:76" not suitable for replacement scans.
Make sure that "df" is either a pandas.DataFrame, or pyarrow Table, Dataset, RecordBatchReader, or Scanner

My return statement needed to be this instead:

    return df.df()

or this:

    return df.arrow()

I wonder if this is why that isinstance(df, duckdb.DuckDBPyRelation) logic was in there?

jwills commented 1 year ago

Ah yeah I bet you’re right @dbeatty10– but it seems like we can/should just treat DuckDBPyRelations as valid objects (albeit ones that need to be handled a bit differently)