dbt-labs / dbt-core

dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.
https://getdbt.com
Apache License 2.0
9.97k stars 1.63k forks source link

Allow declaration of vars in schema.yml #2377

Closed DVAlexHiggs closed 4 years ago

DVAlexHiggs commented 4 years ago

Describe the feature

It would be excellent if it were possible to provide vars to models in schema.yml files like so:

version: 2
models:
  - name: <model name>
    description: <description>
    docs:
      show: <true|false>
    meta:
      key1: <value1>
      key2: <value2>
    vars:                     // Like this
      key1: <value1>
      key2: <value2>
    tests:
      - <test>
      - ...
    columns:
      - name: <column name>
        quote: <true|false>
        tags: <tags>
        description: <column description>
        tests:
          - <test>
          - ...
      - ...

Additional context

I'm suggesting this feature because the dbt_project.yml file gets considerably large and hard to manage at scale, when needing to provide variables to multiple models. It would be a lot easier to manage if there were a way to split the variables into sub-files, and I thought the schema.yml files would be a perfect candidate, as they already exist and are designed to hold other information related to models.

Who will this benefit?

This would mainly benefit larger companies who have created libraries of macros/packages with arguments whose values differ between models, and would prefer not to fill their model files with the setting of variables, but find it too difficult to manage if it's all in the dbt_project.yml

drewbanin commented 4 years ago

hey @DVAlexHiggs - cool idea! Unfortunately, I don't think this is something we're going to add support for.

If you check out #2312 , you can see that we very recently made a ton of changes around variable scoping. Historically, variables were scoped at the folder-level. This meant that you could do something like:

models:
  my_project:
    path1:
      vars: {myvar: abc123}
    path2:
      vars: {myvar: def456}

In this setup, a model in models/path1 would get a variable called myvar with the value abc123 whereas a model in models/path2 would get a value of def456. This made it really tricky to correctly build compilation contexts for different node types. Imagine you have a schema.yml file in models/path2 which describes a model in models/path1 -- what do you think the myvar value should evaluate to in that context?

To resolve inconsistencies like this, we've added a vars: block to the dbt_project.yml file which cannot be scoped to a particular path. It looks like:

vars:
  myvar: abc123

models:
  ...

Note though that variables can still be scoped to packages:

vars:
  package1:
    myvar: abc123
  package2:
    myvar: abc123

I just wanted to give some background on how we're thinking about variables, and some of the recent changes that impact variable usage. Given these changes, I don't think we'll be able to add model-scoped variables like you've described here. I do however think your use case is a good one, and I can imagine two paths forward:

  1. Supplying configurations in schema.yml files

This would look like:

models:
  - name: my_model
     config:
       key1: <value1>
       key2: <value1>

We could then provide a way to tap into these configuration values in the model .sql file context, eg:

select {{ config.get('key1') }} from ....
  1. Allow variables to be defined outside of the dbt_project.yml file

See #1790 for more info

Curious to hear what you think about these changes and options

DVAlexHiggs commented 4 years ago

Hey @drewbanin thanks for your response. I think both of these solutions have their merits. Solution 1 is more dbt-esque and provides more functionality to the current schema.yml, and keeps all metadata/information about models in one place.

I think solution 2 is a powerful alternative with a lot of versatility, but perhaps more care than usual would need to be taken with directory structures and organisation of the YAML files. I think personally I would prefer solution 1.

DVAlexHiggs commented 4 years ago

HI, any news on this? Really excited if this is being considered. I'm happy to contribute it myself as well, though I would need a point in the right direction to start with. Perhaps we could talk about this?

drewbanin commented 4 years ago

Ok - thanks for the additional context for this one on Slack @DVAlexHiggs.

To recap: You're leveraging the datavault package which relies on vars to configure models pks, natural keys, load timestamps, etc. I do think the dbt_project.yml file was a poor place for configs like that historically, so in light of the vars changes in 0.17.0, we should figure out a better path forwards.

I really like the idea of allowing model configs in .yml files. We have a dedicated issue for this over here: https://github.com/fishtown-analytics/dbt/issues/2401. I think #2401 was inspired by this very issue :)

2401 outlines a very similar syntax to what you described above, except the key will be config, not vars:

version: 2
models:
  - name: <model name>
    description: <description>
    docs:
      show: <true|false>
    meta:
      key1: <value1>
      key2: <value2>
    config:                     // Like this
      key1: <value1>
      key2: <value2>

We'll need to think about the order of parsing. I think a sensible order of operations is:

  1. Apply dbt_project.yml configs
  2. Apply schema.yml file configs
  3. Apply .sql file configs

If we make the configs from the dbt_project.yml file and schema.yml files available in the .sql file context, then you could simply use config.get() instead of var() to capture these config values. This should address your use case with the added benefit of deflating the size of your dbt_project.yml file which is, i bet, huge for large dbt projects on datavault :)

balmasi commented 4 years ago

@drewbanin I think the order you've specified makes sense. This will be a huge help for uber-templated models

DVAlexHiggs commented 4 years ago

Yes the order makes sense, and I agree that config is a better way forward

jtcohen6 commented 4 years ago

Closing in favor of #2401