dbt-labs / dbt-core

dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.
https://getdbt.com
Apache License 2.0
10.03k stars 1.64k forks source link

[Feature] Allow mapping to be used in addition to sequence in YAML to define model columns #10351

Open gastlich opened 5 months ago

gastlich commented 5 months ago

Is this your first time submitting a feature request?

Describe the feature

Currently columns within models: have to be defined as sequence, as follows::

models:
  - name: model_one
    columns:
      - name: first_column
      - name: second_column
        description: This is a column
  - name: model_two
    columns:
      - name: first_column
      - name: second_column

My proposal is to allow to columns to be defined as mapping in addition to the already supported sequence:

models:
  - name: model_one
    columns:
      first_column: null
      second_column:
        description: This is a column
  - name: model_two
    columns:
      first_column: null
      second_column: null

This feature request is driven by a few factors.

  1. Flexibility: The mapping format allows for more flexibility in defining columns. We can use native YAML's features for mapping, like merge https://yaml.org/type/merge.html . But, on the other hand, it doesn't force us to use it. We can still define columns as a simple sequence.

  2. Readability: Thanks to implementing DRY principle, the mapping format is more readable than the sequence format and not as over-bloated. You don't have to repeat columns multiple times. In our case, we produce two types of marts, the latest "state" and the "history". The "history" mart has the same columns as the "state" mart, but with some additional columns. The mapping format would allow us to define the common columns once and then add the additional columns for the "history" mart.

columns_mart__loans: &columns_mart__loans
  source_system:
    description: asdf2
    tests:
      - not_null
  source_system_id:
    tests:
      - unique
      - not_null

models:
  - name: mart__loans_history
    columns:
      <<: *columns_mart__loans
      valid_from:
        description: asdf5
        tests:
          - not_null
      valid_to:
        description: asdf6

  - name: mart__loans
    columns:
      <<: *columns_mart__loans

Overall, I believe that allowing columns to be defined as a mapping in addition to a sequence would make the DBT's YAML files easier to read and maintain.

I am not aware of any internal design decisions within DBT that would make it impossible to implement this feature. The change itself should be relatively simple to implement, by checking the data type of the columns key and then processing it accordingly in a generator, that yields sequence items.

Describe alternatives you've considered

YAML Limitations

As we know, YAML doesn't support flattening merged sequences, making it unsuitable for defining columns. (Reference: YAML Issue #35)

Additionally, YAMLScript is still in its early stages of development, so it may not be suitable for immediate use. (Reference: YAML Issue #48)

DBT's Built-in Feature

I believe DBT should avoid implementing too many YAML-specific features to prevent reinventing the wheel. Outsourcing more features allows DBT to focus on data transformation.

Custom Solution

The same reasoning applies here.

Who will this benefit?

This feature will benefit all DBT users who deal with large models, that are exposed in multiple flavours, like in our case the state and history models. It will also benefit users who want to define columns in a more flexible way, allowing them to use YAML's native features like merge.

Are you interested in contributing this feature?

Yes

Anything else?

No response

AnithaG-Oak commented 3 months ago

+1 upvote

This would definitely help keeping the YAML DRY. We have few large models in our setup and some of them share same columns. The growing size of YAML with repeating columns has become a maintenance concern. Changing the list to dict seems simple yet a smart fix to be able to use YAML merge feature.