dbt-labs / dbt-core

dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.
https://getdbt.com
Apache License 2.0
9.67k stars 1.6k forks source link

[CT-278] [Feature] Allow meta property to be accessible at compile time #4776

Open DVAlexHiggs opened 2 years ago

DVAlexHiggs commented 2 years ago

Is there an existing feature request for this?

Describe the Feature

It would be great if the config.meta property was populated at compile time. This would allow users to use the meta dictionary to generate SQL for their dbt models at scale using the meta dictionary as parameters. For highly parameterised SQL via macros, such as in dbtvault, the following approaches are currently possible:

All three of these approaches work, but aren't particularly scalable or easy to maintain.

Some downsides of the above approaches:

I requested a similar feature here and I believe allowing config and meta to be declared in property (.yml) files was one of the answers to this idea originally. Whilst this is a fantastic feature and has ensured dbt has come on leaps and bounds with metadata management, I believe it's not quite there yet.

I recently tried using meta to replace the previously described approaches (fromyaml, params in model, global variables) and unfortunately came across the issue that we couldn't access the data at compile time. The snippets below demonstrate this:

-- my_stage.sql
SELECT 1 AS "MY_TEST_COLUMN"
-- my_table.sql

{{ config(materialized='table') }}

{% do log('meta: ' ~ config.get('meta'), true) %}
{% do log('param_1: ' ~ config.get('meta')['param_1'], true) %}
{% do log('param_2: ' ~ config.get('meta')['param_2'], true) %}
{% do log('param_3: ' ~ config.get('meta')['param_3'], true) %}
{% do log('param_4: ' ~ config.get('meta')['param_4'], true) %}
{% do log('source_model_name: ' ~ config.get('meta')['source_model_name'], true) %}

{% set param_1 = config.get('meta')['param_1'] -%}
{% set param_2 = config.get('meta')['param_2'] -%}
{% set param_3 = config.get('meta')['param_3'] -%}
{% set param_4 = config.get('meta')['param_4'] -%}
{% set source_model_name = config.get('meta')['source_model_name'] -%}

SELECT 1 AS "{{ param_1 }}"
       , 2 AS "{{ param_2 }}"
       , 3 AS "{{ param_3 }}"
       , 4 AS "{{ param_4 }}"
FROM {{ ref(source_model_name) }}
# properties.yml
version: 2

models:
  - name: my_table
    meta:
      param_1: COLUMN_1
      param_2: COLUMN_2
      param_3: COLUMN_3
      param_4: COLUMN_4
      source_model_name: my_stage

Expected/Desired outcome

-- my_table.sql (compiled)

SELECT 1 AS "COLUMN_1"
       , 2 AS "COLUMN_2"
       , 3 AS "COLUMN_3"
       , 4 AS "COLUMN_4"
FROM MY_DATABASE.MY_SCHEMA.my_stage

Current outcome (dbt 1.0.3)

00:10:38  meta:
00:10:38  param_1:
00:10:38  param_2:
00:10:38  param_3:
00:10:38  param_4:
00:10:38  source_model_name:
00:10:38 Encountered an error:
Compilation Error in model my_table (models/my_table.sql)
  The name argument to ref() must be a string, got <class 'jinja2.runtime.Undefined'>

I realise the original intent was probably that metadata would be static data to have alongside models, similar to tags and documentation, however there is a definite use case for it to drive SQL generation, as well.

Thank you!

Describe alternatives you've considered

As descried above.

Who will this benefit?

Anything else?

gshank commented 2 years ago

I think the problem isn't so much that meta isn't available at compile time, it's that the config from properties.yml isn't available when the ref is processed. A dbt project is parsed in the order: 1) macros, 2) models, 3) schema files. When the model is "parsed", we extract the ref, source, and config calls and use them to set information in the nodes. It doesn't even help to put the meta config in the model, because that data is extracted at the same time as the ref. The model that is ref'd must resolve to a string at parse time, or we cannot construct the DAG properly.

We do have discussions on a regular basis about whether there are advantages to tweaking the order that things are processed, or about moving some flavors of setting to a different place. But these are all pretty complicated and delicate things to change since they are so central to the way that dbt works.

Let us know if you have other ideas about how to make this easier to work with.

DVAlexHiggs commented 2 years ago

Apologies, accidentally closed.

Thanks for your response. Back in the old days we could scope variables specifically to models, and these variables were provided in the dbt_project.yml and used the directory structure to assign the variables to a specific model.

This was good but it quickly became hard to manage as the dbt_project.yml became larger and larger for even the smallest projects.

We should not revert to that, but I'm wondering if model-scoped variables and being able to provide these in properties.yml would provide the same cabability, whilst also being vastly superior in terms of scaling; including all of the advantages of properties.yml such as arbitrary nesting etc, allowing a user to split the variables across multiple files, example:

# properties.yml
version: 2

models:
  - name: my_table
    variables:
      param_1: COLUMN_1
      param_2: COLUMN_2
      param_3: COLUMN_3
      param_4: COLUMN_4
      source_model_name: my_stage
-- my_table.sql

{{ config(materialized='table') }}

SELECT 1 AS "{{ var('param_1') }}"
       , 2 AS "{{ var('param_1') }}"
       , 3 AS "{{ var('param_1') }}"
       , 4 AS "{{ var('param_1') }}"
FROM {{ ref(var('source_model_name')) }}

Perhaps then, adding variable processing to the same level as macros or before or after, but before models, would be a minimal solution to this and not have undesirable impact on the fundamental operation of dbt as a whole?

Really eager to find a solution to this, as it will have huge implications, as described in my initial proposal.

EDIT:

I've thought about this a little more and the downside I can see with this is that it's perhaps not intuitive, as these parameters are fundamentally for configuring the models, rather than variables per-say.

scajanus commented 2 years ago

@DVAlexHiggs does replacing e.g.:

{% set source_model_name = config.get('meta')['source_model_name'] -%} with {% set source_model_name = model.config.meta.source_model_name -%}

work for you as of now? Otherwise can you provide a minimal working example of a project, and I could check if your issue is the same I ran across.

Edit: This is actually adviced against here. The workaround would be to use the execute flag as adviced, but that requires you to add -- depends_on: ref(... back to the referenced model, which is exactly what you were hoping to avoid.

DVAlexHiggs commented 2 years ago

Hi @scajanus. Yes advice against this is the reason I had not explored that option.

I can confirm it does not work reliably, due to the reasons mentioned in the docs you linked (incomplete graph).

There must be a way we can get this feature implemented in a dbt-esque way

DVAlexHiggs commented 2 years ago

Putting the meta in the model's config block also doesn't seem to work as desired:

https://github.com/Datavault-UK/dbtvault/issues/106#issuecomment-1078984150

@gshank Odd that using a config block in the model doesn't work either, is the config processed after the model? That doesn't seem to make sense to me

scajanus commented 2 years ago

@DVAlexHiggs my understanding is that none of the config is treated as final until the full graph is parsed, i.e. there is no (pre)parser that would check if a part of the config is already safe to use -- as in the model case.

The easiest way forward would seem to be allowing defining globals (or template/model locals) that cannot be changed by the graph/template parsing. Jinja has The Global Namespace and Template Globals

Otherwise you would need to define the order in which each of the tags are expanded and do multiple passes of parsing, which seems nasty; as you would need to also check if config is mutated and determine if that is allowed/what should happen then? If I've understood the discussion here correctly, that is :)

Either your config is fixed before parsing the graph (as it used to be?), or you allow the graph to change the config. If you allow the graph to change the config, you cannot use it to alter the way the graph is parsed (and expect consistent results). So if you want to use some variables during the parsing, they need to be immutable by the parsing, as the current variables are. We seem to be looking for a way to have variables/config that is model-specific but immutable.

I went through some of the related issues/comments from before (https://github.com/dbt-labs/dbt-core/issues/2377#issuecomment-621906565, https://github.com/dbt-labs/dbt-core/issues/2401, https://github.com/dbt-labs/dbt-core/issues/1790) but none of those change the fact that variables are package global and config can be changed by the graph. They just allow new locations for setting them. Subsequently the feature requested here should not apply for config, subset of config ('meta') or variables, but be its own entity.

DVAlexHiggs commented 2 years ago

@scajanus Thank you for looking into this! This makes sense and helps me to understand why it's not a simple feature to implement. Any chance we can explore this further @gshank ?

DVAlexHiggs commented 2 years ago

I've found a semi-working solution for this, though it's a bit messy due to the need for if execute and -- depends on.

Primarily, the need for if execute severely limits this approach, as users would not be able to make use of only compiling, and not running, as well as other useful dbt features which don't rely on things using dbt run.

# properties.yml 
version: 2

models:
  - name: my_meta_table
    meta:
      param_1: COLUMN_1
      param_2: COLUMN_2
      param_3: COLUMN_3
      param_4: COLUMN_4
      source_model_name: v_stg_orders
-- my_meta_table.sql
{{ config(materialized='table') }}

{% do log('meta: ' ~ config.get('meta'), true) %}
{% do log('param_1: ' ~ config.get('meta')['param_1'], true) %}
{% do log('param_2: ' ~ config.get('meta')['param_2'], true) %}
{% do log('param_3: ' ~ config.get('meta')['param_3'], true) %}
{% do log('param_4: ' ~ config.get('meta')['param_4'], true) %}
{% do log('source_model_name: ' ~ config.get('meta')['source_model_name'], true) %}

{% set param_1 = config.get('meta')['param_1'] -%}
{% set param_2 = config.get('meta')['param_2'] -%}
{% set param_3 = config.get('meta')['param_3'] -%}
{% set param_4 = config.get('meta')['param_4'] -%}
{% set source_model_name = config.get('meta')['source_model_name'] -%}

-- depends_on: {{ ref('v_stg_orders') }}

{% if execute %}

SELECT 1 AS "{{ param_1 }}"
       , 2 AS "{{ param_2 }}"
       , 3 AS "{{ param_3 }}"
       , 4 AS "{{ param_4 }}"

FROM {{ ref(source_model_name) }}
{% endif %}

Logs:

08:56:46  Running with dbt=1.0.6
08:56:46  meta: 
08:56:46  param_1: 
08:56:46  param_2: 
08:56:46  param_3: 
08:56:46  param_4: 
08:56:46  source_model_name: 
08:56:46  Found 32 models, 0 tests, 0 snapshots, 0 analyses, 497 macros, 0 operations, 0 seed files, 8 sources, 0 exposures, 0 metrics
08:56:46  
08:56:48  Concurrency: 4 threads (target='dev')
08:56:48  
08:56:48  1 of 1 START table model ALEX_HIGGS.my_meta_table............................... [RUN]
08:56:48  meta: {'param_1': 'COLUMN_1', 'param_2': 'COLUMN_2', 'param_3': 'COLUMN_3', 'param_4': 'COLUMN_4', 'source_model_name': 'v_stg_orders'}
08:56:48  param_1: COLUMN_1
08:56:48  param_2: COLUMN_2
08:56:48  param_3: COLUMN_3
08:56:48  param_4: COLUMN_4
08:56:48  source_model_name: v_stg_orders
08:56:52  1 of 1 OK created table model ALEX_HIGGS.my_meta_table.......................... [SUCCESS 1 in 3.66s]
08:56:52  
08:56:52  Finished running 1 table model in 5.63s.
08:56:52  
08:56:52  Completed successfully
08:56:52  
08:56:52  Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1

Compiled SQL:

-- depends_on: SANDBOX.ALEX_HIGGS.v_stg_orders

SELECT 1 AS "COLUMN_1"
       , 2 AS "COLUMN_2"
       , 3 AS "COLUMN_3"
       , 4 AS "COLUMN_4"

FROM SANDBOX.ALEX_HIGGS.v_stg_orders
rumbin commented 2 years ago

I can only second this feature request.

We are looking for a solution which would allow us to reuse as much as possible from existing marts' models for new marts that are schematically identical but which only differ by some filter or source model names. Thus, we seen to inject the mart's name into the models and into the ref() functions therein.

However, extracting the schema name of the current model is harder than I had expected:

I am mentioning this use case and the not-working solutions at this point, as I feel that the proposed solutions above fail for the same reason.

On a side note, I also want to mention that folder-level variables might serve some of the abovementioned use cases, as long as they would be available at compile time: https://github.com/dbt-labs/dbt-core/issues/4938#issuecomment-1191121432

github-actions[bot] commented 1 year ago

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

rumbin commented 1 year ago

This is too useful to be closed due to staleness...

DVAlexHiggs commented 1 year ago

Yes this should remain open

Turbine1991 commented 1 year ago

I'm confused as to why schema files are parsed after the model.

Furthermore, I can print the variable, just not use it. Which also feels odd.

jolsby commented 1 year ago

This would be a powerful addition to dbt, and be highly valuable. ⬆

datacom-bozhu commented 11 months ago

Voting for this as well

kylienhu commented 8 months ago

Voting for this

darkcofy commented 6 months ago

voting for this

github-actions[bot] commented 2 days ago

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.