dbt-labs / dbt-core

dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.
https://getdbt.com
Apache License 2.0
9.29k stars 1.55k forks source link

dbt extensions / dbt jinja extensions #3830

Closed jpeak-intellify closed 2 years ago

jpeak-intellify commented 2 years ago

Describe the feature

As a dbt user I should be able to install extensions through pip such that I can:

And the dbt cli autoloads these extensions similar to pytest allowing control of the event hooks and the Jinja environment for custom behaviour.

Describe alternatives you've considered

I am trying to automate overriding the generate_database_name and generate_schema_name macros but even if I install a dbt-package with my custom macro logic I still need to manually shim these macros in each repo.

Additional context

N/A

Who will this benefit?

Power users that have an architecture or naming conventions dictated by business users that has outrun what dbt Out-Of-Box (OOB) can achieve.

Ideally allowing the Jinja environment to be extended would be an end goal here. It looks like the do command is already being used from jinja2.ext.do. https://github.com/dbt-labs/dbt/blob/develop/core/dbt/clients/jinja.py#L489

My ideal UX/DX would be:

This leverages prior art for plugin design. It is minimally invasive (leaving plugins for dbt event hooks as a separate scope of work). It leverages the Jinja Extension community as the scope of what can be done. Authoring a python extension better leverages python tooling to maintain that code than Jinja macros in a dbt package.

Are you interested in contributing this feature?

Happy to help contribute this feature. Wanted to discuss scope of work before starting work though to fit in the roadmap.

For reference I was a maintainer of vcrpy API testing library for 6 months in 2019.

jtcohen6 commented 2 years ago

@jpeak-intellify Thanks for opening! It's a really solid thought, and I hope we can kick off some cool discussion.

Big idea

As we continue to solidify the foundations of dbt-core, I'm also interested in making it much more pluggable. We already have some thinking in this vein, ranging from the quite broad to the pretty specific:

So, why not the Jinja context, too?

I think there are two different ways to plug into / extend the syntax and functionality available in the dbt-Jinja templating context:

Are you thinking more about turning on extensions that are installed along with jinja2, or installing custom extensions? If the extensions are all built-ins, I'm thinking that the right "on/off" switch could even be an environment variable (!), rather than another python program/plugin. If you're talking about custom user-defined extensions, though, the need for a python plugin makes sense.

My concerns:

Specific use case

I am trying to automate overriding the generate_database_name and generate_schema_name macros but even if I install a dbt-package with my custom macro logic I still need to manually shim these macros in each repo.

This is a more specific to dbt functionality—I totally hear you, and I think there's a pretty neat way we can enable this!

There's a detailed proposal (https://github.com/dbt-labs/dbt/issues/3456) to specify dbt as the macro_namespace of all globally defined macros. This would allow you to do things like set this one-time config in each project:

# dbt_project.yml
dispatch:
  - macro_namespace: dbt
    search_order: ['current_root_project', 'my_package_with_custom_macro_logic', 'dbt']

Whenever dbt goes to resolve a "global" macro, it would look for candidate implementations (prefixed adapter__ or default__) first in current_root_project, then in my_package_with_custom_macro_logic, then (finally) in the dbt repo.

Your issue prompted me to actually give this a go, which I did over in https://github.com/dbt-labs/dbt/pull/3851. I think it might... just... work :)

Now, to work for a specific global macro, that macro needs to be dispatched, and the generate_x_name macros aren't today. (Historically, we've tended to dispatch macros that require different syntax across database adapter plugins.) If we dispatch those, with the explicit macro_namespace='dbt', and your package reimplements default__generate_schema_name, and you specify that config in your root project file, it should just work—it did when I tested locally just now.

Having said all that, I'm actually quite curious to hear about how you were planning to accomplish the above with Jinja extensions.

jpeak-intellify commented 2 years ago

Thanks @jtcohen6 for the comprehensive context catchup.

Big Ideas

It seems like overall we are on the same page about a larger scope of work about the extensibility of dbt-core which is why I referenced pytest as prior art. dbt (in my mind) sits in a similar space that codifies the backbone of a common process.

dbt seems to have solved a problem for 80-90% of people, but you've given the enterprises that have been datawarehousing for decades a teaser of a better future. A future where the gap between the sophistication of the tool and the complexity of the data modelling are not matched and in honesty that gap is a bit too enterprise specific.

Really awesome reading through the list of points of extension (that look like to hooks I was trying to find) that allow enterprises to do their too specific quirks.

Specific use case

My specific problem is:

How do I codify our custom naming conventions for databases and schemas? a reusable dbt-package How do I reliably get our macros preferred over the default global macros? 🤷

My poor assumption

I think I naively went down the route thinking I am constrained to the Jinja2 environment and blindly suggested trying to gain access to modify the Jinja2 environment.

After looking at some Jinja extensions I am not confident it would achieve what I am after, and if it did, I'm not confident the person following me would be able to maintain that extension. So that was a bad idea in hindsight...

Custom macro namespace precedence:

The following dispatch & search_order issue and PRs seem more in line with the specific problem I am trying to solve for.

Minor concern:

My only concern is that it is using adapter.dispatch(...), and it is now dispatching for more reasons than just adaptors. That can be a separate scope of work for that refactor-rename though.

Ideal outcome:

My ideal outcome is that we have our own BigCorp flavoured dbt package dbt-bigcorp-macros that we reference internally like:

packages.yml

 packages:
    - git: "git@github.com:bigcorp/dbt-bigcorp-macros.git"
      revision: vX.Y.Z

Then it sounds like we would just create a dispatch namespace for them:

bigcorp-macros.sql

 {% macro bigcorp__generate_schema_name(custom_schema_name, node) -%}
     ...
 {% endmacro %}

  {% macro bigcorp__generate_database_name(custom_database_name=none, node=none) -%}
     ...
 {% endmacro %}

Then if I understand correctly, opting into custom ordering would add the following to dbt_project.yml:

dbt_project.yml

dispatch:
   - macro_namespace: dbt
    search_order: ['my_root_project', 'bigcorp', 'dbt']

This would solve a whole class of problems for us, getting predictable behaviour when adopting our own flavour of conventions that deviate from the default dbt training modules.

Next steps?

jtcohen6 commented 2 years ago

@jpeak-intellify Thanks for the detailed follow-up, and for engaging with the multiplicity of thoughts I threw your way.

I think you've understood the dispatch strategy quite well; I just want to clarify a few points. You're right that the primary historical purpose of adapter.dispatch has been to handle cross-database functionality. The advent of dispatch in v0.18 (replacing adapter_macro) added a critical capability: dispatching across packages (macro namespaces). We initially intended this to solve a highly specific problem: extending dbt_utils macros to arbitrarily many database adapters, without requiring a PR into the main package for each one. But the capability has really expanded and matured since then: You define a "routing" macro that users call. Then, statically at parse time (!), on the basis of your project-level dispatch config and your current database adapter, dbt will figure out which macro they actually want to use.

So the syntax of dispatch retains both parts of the equation:

  1. The project-level dispatch config that defines search_order, allowing you to rank installed packages above other installed packages, or above the global project
  2. Adapter-specific viability, in the form of the macro prefix and ranked preference order (e.g. redshift__postgres__default__)

The workflow we've been describing primarily depends on the first piece, but you'll still need to prefix your macros with an adapter signifier (almost certainly default__). So it's exactly the code you shared above, except:

-- dbt_bigcorp_macros/macros.sql

{% macro default__generate_schema_name(custom_schema_name, node) -%}
    ...
{% endmacro %}

 {% macro default__generate_database_name(custom_database_name=none, node=none) -%}
    ...
{% endmacro %}

The bigcorp__ prefix would make sense if you were using a custom dbt-bigcorp database adapter. At the point where it's a preference for the candidate macros defined in the dbt-bigcorp-macros package, that's the role of the project config's search_order.

I get that the default__ prefix may feel unnecessary in this case. But imagine if dbt-bigcorp-macros wanted to be usable by projects running on Apache Spark or Trino or UnknownAdapter, and do different things on each. Then you could define:

-- dbt_bigcorp_macros/macros.sql

{% macro spark__generate_schema_name(custom_schema_name, node) -%}
    ...
{% endmacro %}

{% macro trino__generate_schema_name(custom_schema_name, node) -%}
    ...
{% endmacro %}

{% macro default__generate_schema_name(custom_schema_name, node) -%}
    ...
{% endmacro %}

Following your search_order, dbt would always prefer a macro in your package. It would pick dbt_bigcorp_macros.spark__generate_schema_name on Spark, dbt_bigcorp_macros.trino__generate_schema_name on Trino, and dbt_bigcorp_macros.default__generate_schema_name when running anywhere else.

So, if that all still sounds okay to you, I'm going to:

jpeak-intellify commented 2 years ago

Thanks for the clarifications about the use of dispatched namespaces.

That makes sense that the macros are resolved statically at parse time.