dbt-labs / spark-utils

Utility functions for dbt projects running on Spark
https://hub.getdbt.com/fishtown-analytics/spark_utils/latest/
Apache License 2.0
31 stars 15 forks source link

How to support `dbt-databricks`, too? #23

Closed jtcohen6 closed 2 years ago

jtcohen6 commented 2 years ago

See: Slack thread

The macros in this package presume that you're using an adapter named spark, and so they're prefixed with spark__ accordingly. But what about users of the dbt-databricks adapter?

There are a few options here, which I'll discuss in more detail below. (cc @superdupershant)

Adapter inheritance

dbt will consider "parent adapter" macros as valid dispatch candidates, after searching for the current adapter name (docs).

If possible, this is by far the most expedient solution enabling the most code reuse / requiring the least code duplication. Everything in this package (and other spark-compatible packages, including many of Fivetran's) can stay exactly the same, and will just work with dbt-databricks, too.

How would this look in practice? The dbt-databricks adapter could add an upstream dependency on the dbt-spark adapter, similar to how the dbt-redshift plugin declares a dependency on the dbt-postgres adapter, the the dbt-synapse plugin declares a dependency on the dbt-sqlserver adapter, etc.

Note: Because dbt-spark keeps its tricky dependencies in extras, and otherwise requires only dbt-core and sqlparams, I believe that dbt-databricks could require dbt-spark (with no extras), and it would still not require installing PyHive/pyodbc/etc.

This is definitely the approach I'd recommend. It also lays the foundation for a more symbiotic relationship between dbt-spark and dbt-databricks going forward. The latter can import functionality from the former, the former can upstream cleaner approaches from the latter, and we can collectively minimize the footprint of duplicative code.

Wrapper macros

If adapter inheritance wouldn't work, for every spark__some_macro(), we need a macro named databricks__some_macro(), too. It could be a fully copy-paste of the same code, or (better) it could be a "wrapper" macro that redirects to use the spark__ version:

{% macro databricks__datediff(first_date, second_date, datepart) %} %}
  {{ return(spark_utils.spark__datediff(first_date, second_date, datepart)) }}
{% endmacro %}

This works a lot like macro dispatch, it just needs to be manually specified explicitly each time.

Where could these "wrapper" macros live?

  1. Inside the spark_utils package, below each spark__ macro, we add a databricks__ macro that calls it with the exact same arguments. This is how tsql-utils supported dbt-synapse before "parent" dispatch was introduced in dbt-core v0.20 (https://github.com/dbt-msft/tsql-utils/pull/64).
  2. Inside a separate package, databricks_utils, that shims spark_utils (and thereby, in turn, shims dbt_utils).
  3. Inside the dbt-databricks adapter itself. Adapter plugins can define any macros they want! If dbt-databricks were to include databricks__datediff() within its own codebase, then I think users would just need to add this config:
    # dbt_project.yml
    dispatch:
    - macro_namespace: dbt_utils
    search_order: ['dbt', 'dbt_utils']
    # i.e. 'dbt' instead of 'spark_utils'
    # 'dbt' includes macros in the 'dbt-databricks' plugin, too

Why not pursue the third option? Packages like spark_utils exist because the pace of package development differs from that of dbt core/development development. (The macros in this package may be effected by changes in dbt_utils, but they're less likely to effected by changes in dbt-core.) I also believe the maintenance expectations differ—the macros in this package are tested and useful, but not as thoroughly tested as the functionality in dbt-spark, and not as essential to basic dbt functionality as the macros in the adapter plugin (materializations, create table, etc).

tfayyaz commented 2 years ago

@jtcohen6 just wanted to check if this issue is now resolved and that the adapter inheritance option was used?

jtcohen6 commented 2 years ago

@tfayyaz Correct! Closing :)