datacontract / datacontract-cli

CLI to manage your datacontract.yaml files
https://cli.datacontract.com
Other
428 stars 85 forks source link

export --format dbt #37

Closed simonharrer closed 6 months ago

simonharrer commented 7 months ago

This issue is about exporting a data contract to the dbt format.

adaminsta commented 7 months ago

What is the status on DBT export? @simonharrer. My team use DBT+Snowflake are looking into a few alternatives for data contract implementations like this cli, paypal etc

simonharrer commented 7 months ago

@adaminsta I've started with https://github.com/datacontract/cli/pull/38 and want to release a first version of the dbt export on Friday afternoon.

Would love your thoughts and ideas on the dbt export. Any expectations already in your mind?

adaminsta commented 7 months ago

We want to build a dbt model (in Snowflake) on top of the contract. From the datacontract we are creating a dbt source, dbt schema (with an owner, contract, etc), and the staging model.

Can elaborate more later.

simonharrer commented 7 months ago

We currently have the following export behaviour implemented in the draft PR:

datacontract.yaml

dataContractSpecification: 0.9.2
id: orders-unit-test
info:
  title: Orders Unit Test
  version: 1.0.0
models:
  orders:
    fields:
      order_id:
        type: varchar
        unique: true
        required: true
      order_total:
        type: bigint
        required: true

datacontract export --format dbt datacontract.yaml

version: 2
models:
  - name: orders    
    config:
      materialized: table
      contract:
        enforced: true
    columns:
      - name: order_id
        data_type: text
        constraints:
          - type: not_null
          - type: unique
      - name: order_total
        data_type: integer
        constraints:
          - type: not_null    
simonharrer commented 7 months ago

Just merged a first implementation of the dbt export. It currently has only a mapping for snowflake but already supports all fields in the model. Does this already help you? What is missing for you @adaminsta ?

adaminsta commented 7 months ago

Our idea is to use this tool to automatically create the resources and tables/models needed with CI/CD, based on the contract.

Getting the dbt schema is very useful. To make it even more powerful: 1) Generate a dbt source with the model name as table, based on the snowflake database and schema e.g:

models:
  - name: orders    
    config:
      materialized: table
      contract:
        enforced: true
    columns:
      - name: order_id
        data_type: text
        constraints:
          - type: not_null
          - type: unique
[...]
servers:
  endpoint:
    type: snowflake
    account: xx0000
    database: MY_DB
    schema: MY_SCHEMA

provides

sources:
  - name: contract_name
    database: my_db  
    schema: my_schema  
    tables:
      - name: orders

2) Generate a staging model In our automation process we'd also like to push a PR to our DBT repo in order to remove the bottle of the data engineer. Therefore, it would be useful to also generate a staging model that selects all columns from the source.

{{ config(
    materialized="table"
) }}

select 
    order_id
    [...]
from {{ source('contract_name', 'orders')}}
simonharrer commented 7 months ago

Thanks for sharing your ideas. I think your feature ideas should be included in the tool. Your examples help to drive this! Thank you.

adaminsta commented 7 months ago

(Issue created)

Also, this is not related to dbt format but we are consuming this from a pub-sub topic and putting it into snowflake, but there is no pubsub option for servers. Need something like this: source: type: pubsub project: gcp_project_name name: my_topic

pluttgens commented 6 months ago

Hello everyone, stumbled upon this issue as I have the same need as @adaminsta as far as I understand.

Would it make sense that the generated source includes more info from the contract, such as the column names and metadata ? That would greatly reduce the documentation work and have the information propagated to dbt generated docs as well.

I'd rather have it generate too much (source yaml, staging model yaml and sql selecting all the fields as @adaminsta suggested) and remove what I do not actually need than having to add what I'm missing.

adaminsta commented 6 months ago

@simonharrer I've got some more issues when trying to test and export to DBT. Maybe we could have a quick call? Sent you an email as well from instabox email

simonharrer commented 6 months ago

I did not receive an email. Did you send one to simon.harrer@innoq.com ?

simonharrer commented 6 months ago

Easiest would be if you would join the data contract slack: https://datacontract.com/slack

There, we could easily set up a call.

simonharrer commented 6 months ago

I've just added the dbt-sources export in a very early implementation: https://github.com/datacontract/cli/commit/ac19d27f2b7f31cb2ab9971338bb25918c2665fb

How to use:

datacontract export --format dbt-sources

and this would result in:

version: 2
sources:
  - name: orders-unit-test
    description: The orders data contract  
    database: my-database
    schema: my-schema  
    tables:
      - name: orders 
        description: The orders model
        columns:
          - name: order_id
            tests:
              - dbt_expectations.dbt_expectations.expect_column_values_to_be_of_type:
                  column_type: VARCHAR
              - not_null
              - unique
              - dbt_expectations.expect_column_value_lengths_to_be_between:
                  min_value: 8
                  max_value: 10
              - dbt_expectations.expect_column_values_to_match_regex:
                  regex: ^B[0-9]+$      
            meta:
              classification: sensitive
              pii: true
            tags:
              - order_id
          - name: order_total
            description: The order_total field
            tests:
              - dbt_expectations.dbt_expectations.expect_column_values_to_be_of_type:
                  column_type: NUMBER
              - not_null
              - dbt_expectations.expect_column_values_to_be_between:
                   min_value: 0
                   max_value: 1000000
          - name: order_status
            tests:
              - dbt_expectations.dbt_expectations.expect_column_values_to_be_of_type:
                  column_type: TEXT
              - not_null
              - accepted_values:
                  values:
                    - 'pending'
                    - 'shipped'
                    - 'delivered'

Please have a look. Feedback is appreciated! :-)

simonharrer commented 6 months ago

And we've added support for dbt-staging-sql export: https://github.com/datacontract/cli/commit/7cea40e43b0231acd95a55cbb8269ed2a0d7f4b7

How to use:

datacontract export --format dbt-staging-sql

And this would result in

select 
    order_id,
    order_total,
    order_status
from {{ source('orders-unit-test', 'orders') }}
simonharrer commented 6 months ago

@adaminsta and @pluttgens would love your feedback on the two features that we just implemented.

simonharrer commented 6 months ago

We now have the three dbt export features: for dbt models, dbt sources, and dbt staging sql. Create a new issue when you want to improve on the export features, or add a new dbt export feature.