duckdb / dbt-duckdb

dbt (http://getdbt.com) adapter for DuckDB (http://duckdb.org)
Apache License 2.0
797 stars 69 forks source link

python packages for models dbt: no module named 'pandas' #229

Closed ReneTC closed 11 months ago

ReneTC commented 11 months ago

I'm trying to follow a python model guide for using python transformation instead of sql.

For this reason I made a very simple model in python:

import pandas as pd
def model(dbt, session):
    dbt.config(packages=["pandas"])
    data= dbt.ref("train_test_split")
    data['test_column'] = 1

    return data

However, upon a dbt:run I get:

Python model failed:
No module named 'pandas'

I see this is because the python version is used in transformers/dbt-duckdb/venv/bin/python. Is it possible to install pandas there somehow? I also read it's possible to use module_paths but I couln't get that to work, and I worry about the reproducibility for my shared project with this solution.

I would love your input on this

jwills commented 11 months ago

Ah, that is a Meltano question-- I don't know how to install other Python packages into the virtual environment Meltano creates.

To make it work in the current system, you could have a local plugin that ensured that pandas (and anything else you needed) was installed in the venv at the very start of the dbt-duckdb run. I'm thinking something like this (generated with some help from GPT-4)

Create a directory in the dbt project named "python_modules", add it to the module_paths argument in the profile, and define a module inside of it named e.g. "do_install.py" that looks like this:

import subprocess
import sys

from duckdb import DuckDBPyConnection

from dbt.adapters.duckdb.plugins import BasePlugin

class Plugin(BasePlugin):
    def _install(self, package: str):
        subprocess.check_call([sys.executable, "-m", "pip", "install", package])

    def configure_connection(self, conn: DuckDBPyConnection):
        self._install("pandas")

...and then add do_install to the list of plugins in the profile configuration.

ReneTC commented 11 months ago

Thanks @jwills I appricate your help a lot. I've been trying to follow along but I am not sure if I am doing it right. (I still get the error).

I think I need to tell dbt to execute the updates you suggested? Do you know how you would run dbt with your new changes?

Added this to my dbt profiles:

meltano:
  module_paths: ${MELTANO_PROJECT_ROOT}/transform/python_modules
  plugins: do_install
MartinMikkelsen commented 11 months ago

You can add packages to the pip_url in your meltano.yml which could fix your issue. Something like this

  transformers:
  - name: dbt-duckdb
    variant: jwills
    pip_url: dbt-core dbt-duckdb pandas
    config:
      path: <your-path>
ReneTC commented 11 months ago

Works like a charm @MartinMikkelsen thanks