Store the compiled code from hooks, or statistics on it

matt-winkler commented 5 months ago

Describe the feature

As a project admin, I want to be able to analyze the code written in hooks on dbt models. dbt hooks can execute arbitrary SQL statements. The goal of capturing this data is to be able to catch situations in which business logic is embedded in hooks instead of being coded in dbt models themselves.

Describe alternatives you've considered

Linters might be able to achieve something similar. Scanning SQL Fluff's dbt templater I didn't see any obvious controls for hooks.
If storing the SQL code itself is problematic because of practical, security reasons, it might be useful to compute summary statistics e.g. total number of lines

Additional context

By having additional tools to align the dbt codebase, dbt users can better leverage additional features like restarting jobs from the point of failure.

Who will this benefit?

Groups migrating from legacy SQL systems to dbt.

Are you interested in contributing this feature?

Yes

dave-connors-3 commented 3 months ago

million dollar matt! love this idea -- we rely on the graph variable for all the unpacking of project data, and I am not positive that hooks get written there. I can do some digging, but we may be hamstrung here!

dave-connors-3 commented 3 months ago

I may have spoken too soon -- looks like we have uncompiled code in the node config!

what kind of analysis do you anticipate wanting to do on the sql in hooks?

matt-winkler commented 3 months ago

Nice! Thank you for researching. Off the top:

num_lines (INT)
calls_other_macros (bool)
contains_self_reference (bool) --> to indicate the hook processes data in the table managed by the dbt model itself
performs_insert (bool)
performs_delete (bool)
performs_update (bool)
performs_truncate (bool)
performs_create_table (bool)
calls_stored_procs (bool)

algarbar commented 3 months ago

(+1) - Thanks @matt-winkler for thinking of this idea!

TL;DR: This is a great! Baking guardrails into dbt for hooks will yield a number of benefits such as proper coding, code visibility, and contract effectiveness.

Re-Framing

Perhaps we reframe the feature to befit dbt's larger, overall view of code compilation/visibility and model contracts. We can do this while still reserving actionable progress to the feature's focus of intent- hooks.

The feature, as I understand it, is not limited to project admins alone. The goal appears to be to gaurdrail developers to patterns aligned to a dbt ethos. Any code can go into a hook. But, it probably shouldn't.

RE Code compilation/visibility:

Code compilation of hook components does not alter the DAG in ways perhaps intended by the developer. Dependencies can be created within hooks that are left unexpressed. That's bad.

Furthermore, compiled code is presented without principal components of execution. An update statement in a hook, a workhorse macro, ... many components are left out of visiblity in terms of what will be executed (I.E. The compiled and parsed code). dbt simply doesn't know what will be executed in such code. However, dbt could surface visbility that {some} code will be executed.

RE Model Contracts

You can't have a contract if there is work on its assets outside visibility!

Matt's listed examples illustrate how use of hooks can unintentionally (or intentionally) obfuscate code and dilute contact fulfillment.

@dave-connors-3:

what kind of analysis do you anticipate wanting to do on the sql in hooks?

Analysis would extend to each of the model contract focus domains. Quality: If a hook has a DML (which is effectively a hidden model), does it have a test? Observability: How does code in a hook impact the DAG? etc...

Example of dbt's model contracts focus domains include:

Data/Code Quality
Data Observability
Data Profiling
Data Freshness
Consumption
Documentation:
- Service Level Agreements
- Pricing
- Field Level Documentation
- System Location (HANA/TD/BQ/SF)

Outcome

Building guardrails will entail surfacing warnings/errors related to hook usage. Enhancing how the node config is parsed, compiled, and presented will advance dbt's mission (E.G. in their contracts) and community packages (E.G. project evaluator) in their scope/effectiveness.

dave-connors-3 commented 2 months ago

@matt-winkler @algarbar i spent some time on this today, and there are a couple caveats to what we have available that make me wonder if it's worth investing in!

1. the graph object only contains the raw sql in post hooks.

If a hook calls a macro, all we'll see in the graph is {{ my_macro() }} -- I presume this dilutes a good amount of the potential value here since my assumption is that a lot of these "shadow models" are parametrized and packed into a macro .

2. we already collect hard coded references

Because we parse the entirety of the raw sql in our fct_hard_coded_references model, if the user passes the hook into the {{ config() }} macro, we would detect those references in our regex logic.

Given these 2 things, i'm wondering what we should do next! open to thoughts and suggestions

matt-winkler commented 2 months ago

Hey @dave-connors-3 I'm not sure I agree with the assumption in point 1 above. A major use case for this is identifying when migrations to dbt from other transformation tools go awry, or aren't done according to dbt best practices. Many of those other tools don't include macros like dbt conceives of them at all. It does make sense that e.g. permissions grants via macros might not be as inspectable with this, but that's a different use case than what I (at least) considered for this.

Good point on hard coded references. Do you know if there's a way we can identify those hard coded references are created VIA a hook specifically? More curious on this one.

dbt-labs / dbt-project-evaluator