duckdb / dbt-duckdb

dbt (http://getdbt.com) adapter for DuckDB (http://duckdb.org)
Apache License 2.0
816 stars 69 forks source link

`pyarrow` is required for Python models #155

Closed dbeatty10 closed 1 year ago

dbeatty10 commented 1 year ago

Small problem

Using a fresh install of dbt-duckdb in a virtual environment, I'm getting an error with this simple model:

models/simple_model.py

def model(dbt, session):
    dbt.config(
        materialized = "table"
    )

    df = dbt.ref("my_seed")

    return df.df()
20:34:13  Runtime Error in model my_python_model (models/my_python_model.py)
20:34:13    Python model failed:
20:34:13    local variable 'pyarrow' referenced before assignment
20:34:13  

It's easily solved if I install pyarrow:

python -m pip install pyarrow

Some ideas for solutions

What do you think about including pyarrow in install_requires ?

Or bringing back an error message similar to this?

raise Exception("No pyarrow library available.")

Even better if the error message tells a simple fella like me what to copy-n-paste to get things on the right track:

raise Exception("No pyarrow library available. Try `python -m pip install pyarrow`")
jwills commented 1 year ago

Hrm that feels wrong; I must have inadvertently let a pyarrow dep slip in to the system. Will take a look when I’m back at the keyboard.

dbeatty10 commented 1 year ago

I'm guessing it was always expected that a user would need either pandas or pyarrow for dbt python models, but it looks like all the associated logic and error messages were removed in https://github.com/jwills/dbt-duckdb/pull/114.

pyarrow dependency

If a user doesn't have pyarrow installed, I believe they will still hit this line of code (which requires pyarrow):

        if isinstance(df, pyarrow.Table):

Method to reproduce

  1. Create models/simple_model.py as described above
  2. Uninstall pyarrow from your environment. e.g.,
    python -m pip uninstall pyarrow
  3. Run dbt
    dbt run
jwills commented 1 year ago

yup, I think you're right; like, the old logic was also wrong in that it didn't handle all of the other DataFrame-like things that DuckDB can consume (and tbh, trying to keep up with all of them-- Dataset, RecordBatchReader, Scanner, Polars DataFrames, etc.-- is a fool's errand), but we clearly erred too far on the side of laissez-faire, do-whatever-you-want w/no good error messages here.

Gonna noodle a bit on how best to clean it up but am very open to suggestions here; I also have some ideas related to #143 percolating here.

jwills commented 1 year ago

@dbeatty10 is there any world in which the import dependencies that a python model needed would become a compile-time thing? Like I'm kind of wondering if there is a world where I can infer at compile time what modules you're importing and/or which implicit deps you have based on your Python model code (pandas, pyarrow, polars, etc.) and then I can verify that those deps are present in your environment Like I wonder if I could just do that and see how it looked.