duckdb / dbt-duckdb

dbt (http://getdbt.com) adapter for DuckDB (http://duckdb.org)
Apache License 2.0
816 stars 69 forks source link

Feature Request: Support User Defined Functions #170

Closed amiller-gh closed 1 year ago

amiller-gh commented 1 year ago

As of v0.8.0, duckdb supports a Python Functions API: https://duckdb.org/docs/api/python/function

It would be great to have a supported method for registering user defined functions in dbt-duckdb!

Currently, it's very difficult to register custom functions with dbt-duckdb, because everything runs in a separate thread, and each model run gets it's own connection. (Please correct me if I'm wrong!)

I've been able to proof-of-concept function registration by using dbt-core to drive our project's DBT runs, and wrapping the existing DuckDBConnectionManager.open method. This functionally gives us an on_connection_open hook to register our custom functions (example below).

Obviously this is not ideal :)

Assuming first class UDF support would be a valuable feature, I see at least three API design options:

Thoughts? Happy to help implement if this is a wanted addition.

from dbt.adapters.duckdb.credentials import DuckDBCredentials
from dbt.adapters.duckdb.environments import Environment
from duckdb import DuckDBPyConnection
from duckdb.typing import VARCHAR

def custom_function() -> str:
    return "HELLO DUCKDB"

def register() -> None:
    def initialize_db(cls: Environment, credentials: DuckDBCredentials) -> DuckDBPyConnection:
        connection = old_initialize_db(credentials)
        connection.create_function("custom_function", custom_function, [], VARCHAR)
        return cast(DuckDBPyConnection, connection)

    old_initialize_db = Environment.initialize_db
    Environment.initialize_db = initialize_db  # type: ignore
jwills commented 1 year ago

Hey @amiller-gh, pleasure to meet you and thanks for raising this issue. I agree that this is something that we need to provide good support for in dbt-duckdb.

The need for this UDF support came up for me last week when I was taking another crack at #22 to create a mode that would allow dbt-duckdb to emulate another dbt adapter (in my case, dbt-snowflake ) by using sqlglot to transpile any Snowflake-specific SQL into DuckDB-friendly SQL, and then using custom types and UDFs to handle the rest of the functionality from Snowflake that sqlglot does not (can not? should not?) translate into DuckDB. I just pushed my work here onto a branch: https://github.com/jwills/dbt-duckdb/compare/jwills_snowflake_compat?expand=1 (it's not ready yet and there are other problems with this approach that may lead me to take a different tack, but it at least demonstrates most of the major touch points in the code that need to be changed in order to be able to make this work.)

Leaving that stuff aside, the place that I think we should add the support here would be in the initialize_db method defined here: https://github.com/jwills/dbt-duckdb/blob/master/dbt/adapters/duckdb/environments/__init__.py#L53 The good news is that we only need to set this up once per connection; when you run multiple threads in dbt-duckdb, each thread uses a "cursor" (which in DuckDB is just another instance of the same PyConnection class that is returned from duckdb.connect) that is derived from the parent PyConnection instance returned by duckdb.connect and inherits most of it's settings, including any attached fsspec filesystems and additional databases. The python UDFs get inherited as well, so we only need to declare them once, inside of that initialize_db function, for them to be picked up by all of the threads dbt uses. (I verified this in my snowflake compatibility testing and had a conversation about it in the python channel of the DuckDB Discord.)

So given that, the question is just how we should express the place that the dbt-duckdb code should look for the UDFs to create on the connection object?

We could re-use the DuckDBCredentials class to indicate e.g. a list of modules that have a well-known function defined on them (maybe something like register_udfs(conn: PyConnection) )-- which sounds closest to the macros/*.sql option you described above.

Or we could go all out with a plugin system ala pluggy for defining and registering the UDFs in the modules to include (which would be sort of a hybrid of option 1 and 2 you mentioned?)

I think that one prereq for this is to bite-the-bullet and support the ability to add a local directory inside of the dbt-duckdb project as a component of sys.path by default (maybe modules, as a sibling of models and macros? But that might be annoying for tab-completion purposes) to solve the common problem of having some code that you want to re-use across several of your python models but packaging it up into a custom python module that you need to distribute/install separately from your dbt project is too much of a PITA.)

amiller-gh commented 1 year ago

Edit: Sorry in advance for the novel – I like API design!

Ah very cool re: initialize_db and the PyConnection inheritance – just updated my monkeypatch to use that instead!

Agree with you on the need to bite-the-bullet and add local directory support of some kind. I've actually already extended my monkepatched function registration POC to include super simple automatic function file discovery!

A programmatic API is very helpful for module maintainers, or people like me who manage their own DBT runner by calling dbt-core directly, but for your average project maintainer who is doing everything through the CLI, the programmatic API doesn't actually help much. Any implementation that would expose the programmatic API to CLI users, very quickly scope-creeps to becomes that on-disk implementation! May as well start there from the get-go.

(Conveniently, any API that's file-system based will also work for most programmatic integration requirements as well :)

I apologize in advance: my background is in the Javascript tooling ecosystem, so that's where most of my design inspiration will come from. But, I do really like the decorators-as-a-hook API model that tools like pluggy bring – it actually works similarly to some other libraries I maintain in the JS world. Actually, the combination of decorators, lifecycle hooks, and file-system based discovery reminds me a lot of Ember.js' addon system – it's super extensible, easy to integrate with, and scales nicely. It's also easy to add guardrails to keep users from shooting themselves in the foot.

If you do end up going with the the on-disk, decorator API route, you can actually do some pretty interesting stuff to keep the feature well-scoped, without coding the library into a corner. E.g:

# file: /dbt/duckdb/custom-function.py

from dbt.adapters.duckdb import hooks
from duckdb.typing import BIGINT
from duckdb import DuckDBPyConnection

# Functions decorated with hooks.create_function will be auto-discovered by dbt-duckb at runtime.
# The hook API footprint mirrors the native duckdb create_function, allowing full configuration.
# Optional function type arguments can be provided if mypy is not used, just like the library.
@hooks.create_function('my-custom-func', [BIGINT], BIGINT, exception_handling='return_null') 
def custom-function(config, args):
    return 'HELLO DUCKDB'

# With a decorator library, you can start by rolling out a single hook: create_function.
# As additional use cases come up, you can easily extend the hooks library as needed.
@hooks.after_connect()
def custom-function(connection: DuckDBPyConnection): DuckDBPyConnection
    return connection

Alternatively, a whole lot of tooling in the frontend build ecosystem is actually moving toward a "code-as-config" model, which may apply nicely here. (E.g: Rollup, Vite, Broccoli.js, Parcel, etc). This means they basically just say "screw it" to trying to interpret what features a user might want out of a config, and instead just give them a single well-named script file to let them do whatever they want with the pipeline. Now, there's normally some amount of additional add-on / middleware API layered into here to keep the wheels on, but for the most part they forgo auto-discovery or over complicated APIs in the name of simplicity and flexibility. It's much simpler to maintain, but a whole lot harder to limit the user's control! In this case, that may mean an api more like:

# file: /duckdb.config.py

def config(connection: DuckDBPyConnection):
    # User can do whatever the heck they want here.
    # Maybe instead of only passing the connection, you pass Environment, or dbt, or whatever.
    return connection

So, I suppose I have three questions to add to the growing API design pile that came from my rambling above:

  1. Do you have any concerns with exposing the raw duckdb connection (or more) to consumers? Technically, the goal here is to allow registration of UDFs, and a whole host of shenanigans can happen if you just expose the entire connection to make that happen. I personally find it easier to open API access up later than it is to roll back an over-permissive API change. Can totally see the argument to the contrary though – I know as a user I'd love carte-blanche access to the connection, even if I break it ;)

  2. If you go forward with the auto-discovered, on-disk API, do you want to claim a namespace that the main DBT project will never use under sys.path for all dbt-duckdb specific components? I could totally see DBT releasing a feature called modules or functions down the line, and stomping all over that folder name in sys.path. It may be better to have everything in a folder called {sys.path}/duckdb, or even have the folder be a sibling to sys.path, to future-proof the feature-set.

  3. How do you imagine these functions will interact with DBT packages? There are a lot of namespacing and execution order issues to figure out there... I think there's certainly value in first releasing a version of the hooks system that only discovers hooks in the local project, but whatever the design is will likely need to take package scope names into account right?

jwills commented 1 year ago

@amiller-gh thank you for this, and apologies that it took me a couple of days to digest it and come back to this issue with some thoughts/responses to your questions.

  1. I am temperamentally much more inclined to give users direct access to/control over as much of the logic as possible, so a configuration-as-code approach is the one that is most appealing to me. I also think it's the most practical; as the DuckDB folks keep adding new functionality at a tremendous pace (including, duh, user-defined functions) it's the easiest way for the project to keep up with the needs of the most advanced users w/o needlessly overcomplicating the lives of folks who just want to get work done.

  2. I share your concerns about dbt coming along and stomping on whatever on-disk auto-discovery system I come up with, and (related to 3) I'm worried about the complexity that having multiple modules wanting to do modifications of the DuckDBPyConnection in an undefined order. As a concrete example, I'm thinking really hard about updating PR #22 to add a Snowflake emulation mode to dbt-duckdb, but doing that well is likely going to require defining a whole bunch of UDFs that implement functions from Snowflake that aren't transpiled by sqlglot. So immediately, I could be in a situation where the emulation mode would want to add a bunch of UDFs and the user could want to add a bunch of (potentially conflicting) UDFs, and we would want to have a clear way to indicate the relative priority of the implementations in a way that the user can control. Again, opting for config-as-code in item 1 helps solve part of the problem here, but I think we're going to need some general way -- likely in the profiles.yml config-- to specify a list of modules that need to be loaded at startup time.

  3. There are several places where I think plugins makes sense in the context of dbt-duckdb in addition to the connection-type events, including: a. The file formats we support for external models (currently CSV and Parquet, but it was recently made evident that you can use the new spatial stuff to read and write Excel files) b. The ability to work with sources/seeds that require a little bit of code execution in order to work correctly, as I started working on with the plugins constructs in the 1.5.x line of dbt-duckdb but haven't really advertised/documented yet b/c I'm not sure I have it right yet.

In my ideal world, there would be a unified way to handle all of these different plugin-like constructs inside of dbt-duckdb that had the same basic setup for discoverability and implementation.

jwills commented 1 year ago

Came up with a relatively painless revamp of the plugin setup I created for loading external data and generalized it to support configuring the main DuckDBPyConnection; I updated the test_plugins with an example of it in action that I think will work for my purposes (and hopefully yours too!) -- please take a look and let me know what you think!