dbt-labs / docs.getdbt.com

The code behind docs.getdbt.com
https://docs.getdbt.com/
Apache License 2.0
113 stars 863 forks source link

Recipe for write-audit-publish / blue-green deploy #4648

Open dbeatty10 opened 6 months ago

dbeatty10 commented 6 months ago

Contact Details

dbeatty10

I have read the dbt Developer Blog contribution guidelines.

Which of these best describes you?

What is the topic of your post?

We talked about blue/green deployments in this Developer Blog post:

But we haven't (yet!) provided some hands-on recipes for blue/green deployments that utilize dbt clone; we just linked here instead.

Link to an initial outline.

Wanna run your data tests before exposing your models to downstream consumers?

Then a blue/green deploy (also known as the write-audit-publish paradigm) is what you are looking for, and dbt clone is one way to accomplish it.

A blue/green deployment using dbt clone can build your models in a "blue" environment, run the data tests against it, and then surface the output in a "green" environment.

Recipes

See below for some (rough) recipes we can play around with which are all variants on a similar theme.

Basically, we can use a regular dbt build that targets a (stand-alone and immutable) staging environment, and then use dbt clone to update production references to point back to it.

I don't know which is "best" -- these are just a handful of different approaches that I've tried with toy projects. They would need to more fully assessed and put through their paces before applying to production workloads.

Recipe 1

This recipe assumes that the name of your target directory appears within clean-targets. The --full-refresh is crucial to overwrite any pre-existing objects in the green environment.

dbt clean
dbt build --target blue && \
cp target/manifest.json . && \
dbt clone --target green --state . --full-refresh

Recipe 2

Second verse, similar to the first. Only real change is using an entire folder for storing the state and using a different method to clean out that state directory in between executions.

rm -rf state-blue || true
dbt build --target blue && \
cp -r target state-blue && \
dbt clone --target green --state state-blue --full-refresh

Recipe 3

Largely cosmetic change that just switches the order of some of the commands.

dbt build --target blue && \
rm -rf state-blue || true && \
cp -rf target state-blue && \
dbt clone --target green --state state-blue --full-refresh

Recipe 4

This time we've included a helpful error message when any of the data tests don't pass (or the build fails for some other reason). Also experiments with alternative bash syntax for forcing a successful command, no matter what.

dbt build --target blue || { echo "Halting due to failed build"; false; } && \
{ rm -rf state-blue; true; } && \
cp -r target state-blue && \
dbt clone --target green --state state-blue --full-refresh

Project files

The models act as the initial "Write".

models/table_1.sql

{{ config(materialized="table") }}

select
    null as id,
    {{ dbt.string_literal(target.type) }} as target_type,
    {{ dbt.string_literal(this.schema) }} as this_schema,
    {{ dbt.string_literal(this.name) }} as this_name

models/view_1.sql

{{ config(materialized="view") }}

select
    {{ dbt.string_literal(target.type) }} as target_type,
    {{ dbt.string_literal(this.schema) }} as this_schema,
    {{ dbt.string_literal(this.name) }} as this_name

from {{ ref("table_1") }}

models/view_2.sql

{{ config(materialized="view", schema="my_custom_schema") }}

select
    {{ dbt.string_literal(target.type) }} as target_type,
    {{ dbt.string_literal(this.schema) }} as this_schema,
    {{ dbt.string_literal(this.name) }} as this_name

from {{ ref("view_1") }}

The data tests act as the "Audit".

models/_models.sql

models:
  - name: table_1
    columns:
      - name: id
        tests:
          - not_null

The "Publish" piece is purely through the dbt clone command (which uses the target definition in profiles.yml).

profiles.yml

postgres:
  target: blue

  outputs:
    blue:
      type: postgres
      dbname: 'postgres'
      host: 'localhost'
      port: 5432
      schema: dbt_blue_{{ modules.datetime.datetime.now().timestamp() | int }}

    green:
      type: postgres
      dbname: 'postgres'
      host: 'localhost'
      port: 5432
      schema: dbt_green

And generating the relevant database object names is via a custom generate_schema_name macro (and supporting macros).

macros/get_custom_schema.sql

{% macro check_if_production() -%}
    {%- set is_production = target.name == 'green' -%}
    {% do return(is_production) %}
{%- endmacro %}

{% macro generate_schema_name(custom_schema_name, node) -%}
    {%- set default_schema = target.schema -%}

    {%- if custom_schema_name is none -%}
        {{ generate_schema_name_not_custom(custom_schema_name, node, default_schema) }}
    {%- else -%}
        {{ generate_schema_name_custom(custom_schema_name, node, default_schema) }}
    {%- endif -%}

{%- endmacro %}

{% macro generate_schema_name_not_custom(custom_schema_name, node, default_schema) -%}
    {{ default_schema }}
{%- endmacro %}

{% macro generate_schema_name_custom(custom_schema_name, node, default_schema) -%}
    {%- set is_production = check_if_production() -%}

    {%- if is_production -%}
        {{ custom_schema_name | trim }}
    {%- else -%}
        {{ default_schema }}_{{ custom_schema_name | trim }}
    {%- endif -%}

{%- endmacro %}

Outstanding Problems

History

The general pattern has probably been practiced in various forms for longer than we'll ever know. But it's been popularized in the collective consciousness under the monikers "blue/green" and "W.A.P.".

Blue/green deployments

Write-audit-publish

dbt

Gilding the lily

Someone with the last name name Blue that was an adherent of WAP at Netflix:

bbrewington commented 6 months ago

This is awesome! Look forward to referencing the post once it's written, audited, and published

Would be cool to include how to do this with both dbt Core & Cloud. Which might involve explaining all the different places target.name and env_var('DBT_TARGET') can be defined in Cloud - and if it makes sense to split this out as a separate post or docs page, glad to write it up!

Locations resolving to env_var('DBT_TARGET')

Locations resolving to target.name

dbeatty10 commented 6 months ago

This is awesome! Look forward to referencing the post once it's written, audited, and published

See what you did there ❤️

Your cross-walk for places to define target.name and env_var('DBT_TARGET') is golden 🤩

segoldma commented 2 months ago

Curious if there is a preferred way to do this in dbt cloud: