Create a simple plugin system for loading data from external sources

jwills commented 1 year ago

Fixes #137 and addresses some longstanding roadmap items (esp. to make it easier to use dbt-duckdb with data in Iceberg/delta tables.)

The idea here is that there are certain kinds of external sources (e.g., Excel files, Google Sheets, or Iceberg tables), where we need to execute a little bit of custom Python code to transform the data that the source contains into something that DuckDB can consume as a table (like a pandas/polars DataFrame, or a PyArrow table/dataset.)

One way to include data from these sources in a dbt-duckdb project is to use Python models, but I've noted in my own work that this is fairly tedious to do because it involves:

1) a lot of trivial and boilerplate Python that isn't very much fun to write, and 2) some tricky and/or non-obvious code to allow the Python models to access/operate on shared resources (e.g., a Google client or an Iceberg catalog) that requires some amount of configuration info that is really best contained inside of the profiles.yml file that is designed for exactly that sort of thing.

The idea here is to create a plugin system for dbt-duckdb (which is, somewhat awkwardly, itself a plugin for dbt-core) which lets us define the code we need to extract data from these external systems in a way that makes them accessible to DuckDB and our downstream transformation logic. To support this, we are using some special meta tags on the source (ala the external_location trick we use now for CSV/Parquet files) as arguments to the plugin's load function, which then returns a data object that DuckDB knows how to convert into a table for use by the rest of the pipeline.

You can think of these plugins as a way to support "elT" use cases (as opposed to dbt's standard "ELT" workloads, where you really do want a high-powered extracting/loading system at your disposal), where the extraction/loading work is simple and safe enough that a small python script is all we want/need.

jwills commented 1 year ago

@JCZuurmond curious for your take here, as my inspiration for this. ;-)

buremba commented 1 year ago

I wonder how useful it would be to combine it with Singer taps to support a number of different sources out of the box. It seems feasible by mocking singer.write_schema and singer.write_records and converting the data to DataFrame internally but I didn't try it myself.

jwills commented 1 year ago

I wonder how useful it would be to combine it with Singer taps to support a number of different sources out of the box. It seems feasible by mocking singer.write_schema and singer.write_records and converting the data to DataFrame internally but I didn't try it myself.

Yeah this is a great q b/c I've been wondering about how far I can push this pattern. Right now, I think that these plugins run single-threaded as part of the graph compilation piece of the dbt execution, so I didn't want to make it too easy to inject things that are arbitrarily slow/complicated in here (e.g., pulling in a lot of data from the GH API) b/c that really should be done externally to dbt-duckdb using one of the many excellent EL frameworks out there which can e.g. run multiple threads/processes, restart from failure, be aware of things like API rate limits, etc., etc.

duckdb / dbt-duckdb

Create a simple plugin system for loading data from external sources #141