Recipe for write-audit-publish / blue-green deploy

dbeatty10 commented 6 months ago

Contact Details

dbeatty10

I have read the dbt Developer Blog contribution guidelines.

[X] I have read the dbt Developer Blog contribution guidelines.

Which of these best describes you?

[ ] I am a dbt Community member or partner contributing to the Developer Blog
[X] I work for dbt Labs and am creating this issue for a community or marketing approved piece.

What is the topic of your post?

We talked about blue/green deployments in this Developer Blog post:

To defer or to clone, that is the question

But we haven't (yet!) provided some hands-on recipes for blue/green deployments that utilize dbt clone; we just linked here instead.

Link to an initial outline.

Wanna run your data tests before exposing your models to downstream consumers?

Then a blue/green deploy (also known as the write-audit-publish paradigm) is what you are looking for, and dbt clone is one way to accomplish it.

A blue/green deployment using dbt clone can build your models in a "blue" environment, run the data tests against it, and then surface the output in a "green" environment.

Recipes

See below for some (rough) recipes we can play around with which are all variants on a similar theme.

Basically, we can use a regular dbt build that targets a (stand-alone and immutable) staging environment, and then use dbt clone to update production references to point back to it.

I don't know which is "best" -- these are just a handful of different approaches that I've tried with toy projects. They would need to more fully assessed and put through their paces before applying to production workloads.

Recipe 1

This recipe assumes that the name of your target directory appears within clean-targets. The --full-refresh is crucial to overwrite any pre-existing objects in the green environment.

dbt clean
dbt build --target blue && \
cp target/manifest.json . && \
dbt clone --target green --state . --full-refresh

Recipe 2

Second verse, similar to the first. Only real change is using an entire folder for storing the state and using a different method to clean out that state directory in between executions.

rm -rf state-blue || true
dbt build --target blue && \
cp -r target state-blue && \
dbt clone --target green --state state-blue --full-refresh

Recipe 3

Largely cosmetic change that just switches the order of some of the commands.

dbt build --target blue && \
rm -rf state-blue || true && \
cp -rf target state-blue && \
dbt clone --target green --state state-blue --full-refresh

Recipe 4

This time we've included a helpful error message when any of the data tests don't pass (or the build fails for some other reason). Also experiments with alternative bash syntax for forcing a successful command, no matter what.

dbt build --target blue || { echo "Halting due to failed build"; false; } && \
{ rm -rf state-blue; true; } && \
cp -r target state-blue && \
dbt clone --target green --state state-blue --full-refresh

Project files

The models act as the initial "Write".

models/table_1.sql

{{ config(materialized="table") }}

select
    null as id,
    {{ dbt.string_literal(target.type) }} as target_type,
    {{ dbt.string_literal(this.schema) }} as this_schema,
    {{ dbt.string_literal(this.name) }} as this_name

models/view_1.sql

{{ config(materialized="view") }}

select
    {{ dbt.string_literal(target.type) }} as target_type,
    {{ dbt.string_literal(this.schema) }} as this_schema,
    {{ dbt.string_literal(this.name) }} as this_name

from {{ ref("table_1") }}

models/view_2.sql

{{ config(materialized="view", schema="my_custom_schema") }}

select
    {{ dbt.string_literal(target.type) }} as target_type,
    {{ dbt.string_literal(this.schema) }} as this_schema,
    {{ dbt.string_literal(this.name) }} as this_name

from {{ ref("view_1") }}

The data tests act as the "Audit".

models/_models.sql

models:
  - name: table_1
    columns:
      - name: id
        tests:
          - not_null

The "Publish" piece is purely through the dbt clone command (which uses the target definition in profiles.yml).

profiles.yml

postgres:
  target: blue

  outputs:
    blue:
      type: postgres
      dbname: 'postgres'
      host: 'localhost'
      port: 5432
      schema: dbt_blue_{{ modules.datetime.datetime.now().timestamp() | int }}

    green:
      type: postgres
      dbname: 'postgres'
      host: 'localhost'
      port: 5432
      schema: dbt_green

And generating the relevant database object names is via a custom generate_schema_name macro (and supporting macros).

macros/get_custom_schema.sql

{% macro check_if_production() -%}
    {%- set is_production = target.name == 'green' -%}
    {% do return(is_production) %}
{%- endmacro %}

{% macro generate_schema_name(custom_schema_name, node) -%}
    {%- set default_schema = target.schema -%}

    {%- if custom_schema_name is none -%}
        {{ generate_schema_name_not_custom(custom_schema_name, node, default_schema) }}
    {%- else -%}
        {{ generate_schema_name_custom(custom_schema_name, node, default_schema) }}
    {%- endif -%}

{%- endmacro %}

{% macro generate_schema_name_not_custom(custom_schema_name, node, default_schema) -%}
    {{ default_schema }}
{%- endmacro %}

{% macro generate_schema_name_custom(custom_schema_name, node, default_schema) -%}
    {%- set is_production = check_if_production() -%}

    {%- if is_production -%}
        {{ custom_schema_name | trim }}
    {%- else -%}
        {{ default_schema }}_{{ custom_schema_name | trim }}
    {%- endif -%}

{%- endmacro %}

Outstanding Problems

After each deploy to green, we've got (at least) 2x the number of database objects versus just the blue environment.
And every time we do a blue/green deployment, we've got an old schema no longer in use that will need to be cleaned up due to generating brand-new objects each deploy rather than re-using the previous objects in-place.
I didn't test out incremental types of resources like snapshots, incremental models, or materialized views.
- I'm guessing these could have trouble if we keep the non-repeatable dynamic timestamp-based blue schema in profiles.yml, but will work just fine if we use a repeatable and constant schema name like dbt_blue

History

The general pattern has probably been practiced in various forms for longer than we'll ever know. But it's been popularized in the collective consciousness under the monikers "blue/green" and "W.A.P.".

Blue/green deployments

Write-audit-publish

https://www.youtube.com/watch?v=fXHdeBnpXrg
- https://www.slideshare.net/MichelleFUfford/scaling-data-quality-netflix-76917740

dbt

Gilding the lily

Someone with the last name name Blue that was an adherent of WAP at Netflix:

https://www.mail-archive.com/dev@iceberg.apache.org/msg00738.html

bbrewington commented 6 months ago

This is awesome! Look forward to referencing the post once it's written, audited, and published

Would be cool to include how to do this with both dbt Core & Cloud. Which might involve explaining all the different places target.name and env_var('DBT_TARGET') can be defined in Cloud - and if it makes sense to split this out as a separate post or docs page, glad to write it up!

Locations resolving to env_var('DBT_TARGET')

Environment Variables
Job > Advanced Settings > Environment variables > Override

Locations resolving to target.name

Profile > Credentials (for project) > Target name
profiles.yml in repo
Job > Target name

dbeatty10 commented 6 months ago

This is awesome! Look forward to referencing the post once it's written, audited, and published

See what you did there ❤️

Your cross-walk for places to define target.name and env_var('DBT_TARGET') is golden 🤩

segoldma commented 2 months ago

Curious if there is a preferred way to do this in dbt cloud:

without overriding ref.sql
without chaining two separate jobs together (given the need to build to one target and clone to another)

dbt-labs / docs.getdbt.com