duckdb / dbt-duckdb

dbt (http://getdbt.com) adapter for DuckDB (http://duckdb.org)
Apache License 2.0
816 stars 69 forks source link

Possible stale DuckDBPyConnection? #203

Closed ghukill closed 1 year ago

ghukill commented 1 year ago

I've hesitated submitting an issue, because I think my use case is pretty unusual, but figured it couldn't hurt; please feel free to disregard and/or close this out.

My context: running DBT transformations with an ephemeral DuckDB database, using the fairly new programmatic invocation.

The issue: the first invocation works perfectly, with a dynamic filepath for path in the profiles.yml. But the second invocation from the same python process, does not work. The error I've been investigating is no binding for the catalog:

DbtRuntimeError: Runtime Error
  Binder Error: Catalog "tmp_bc05271b-d5e8-4d03-a17a-cd2ac9058e35" does not exist!

The temporary DuckDB database are named via UUIDs and are cleaned up after each DBT run. Therefore a duckdb database file is removed after a successful run. I have confirmed this. But on the second invocation -- from the same python process -- I can see after the error that a DuckDB database file using the original UUID has been recreated.

I should note: if restarting the script / shell / python process, it runs fine each time. It's specifically only when the DBTRunner is invoked twice, sequentially from the same process.

My hunch is that a thread specific DuckDBPyConnection might be getting reused that has the first invocation's context? But while inspecting objects and threads in a debugging context, having a hard time finding the original UUID anywhere that might suggest why or how it's still getting used. Everywhere, even in debuggin DBT output, I'm seeing the correct second invocation UUID.

Here is a small subset snippet of my code if helpful...

profiles.yml:

dbt_core:
  target: dev
  outputs:
    dev:
      type: duckdb
      path: "{{ var('tmp_duck_db_filepath') }}"
      schema: dbt_core
      threads: 4

invoking DBT:

self.dbt = dbtRunner()
cli_args = [
    "build",
    "--vars",
    json.dumps(
        {
            "tmp_duck_db": self.duckdb_database,
            "tmp_duck_db_filepath": self.duckdb_filepath,
        }
    ),
]
cmd_result: dbtRunnerResult = self.dbt.invoke(
    cli_args,
    log_level="debug",
    project_dir=self.dbt_project_dir,
    profiles_dir=self.dbt_project_dir,
)

Again, I realize it's unusual to a) use the relatively new DBTRunner approach, and b) invoke more than once from the same calling python process, but figured couldn't hurt to share.

And all said, thanks for a great library and adapter! I'm still able to use for extremely handy and good use.

jwills commented 1 year ago

Hey @ghukill, thanks so much for the nice note and bug report! To work around this, you're going to need to add a couple of lines in-between the dbt.invoke calls in order to force resetting the DuckDB connection in between runs.

To start, add this import to the file where you're calling self.dbt.invoke:

from dbt.adapters.duckdb.connections import DuckDBConnectionManager

...and then before you call self.dbt.invoke, you'll need to call this line:

DuckDBConnectionManager.close_all_connections()

I'll see if I can figure out a fix in the meantime so that this workaround isn't necessary.

ghukill commented 1 year ago

Works a peach! Thanks @jwills!

I had sniffed around this at one point, but had incorrectly thought I needed to pass a DBT context to DuckDBConnectionManager to have it correctly identify and close connections. But this works great; able to run multiple, sequential transformations from the same process now. Thanks a bunch!