dbt-labs / dbt-core

dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.
https://getdbt.com
Apache License 2.0
9.68k stars 1.61k forks source link

[Feature] No more jinja block for snapshots - new snapshot design ideas #10246

Closed graciegoheen closed 6 days ago

graciegoheen commented 4 months ago

Is this your first time submitting a feature request?

Current State

To configure a snapshot currently, you must nest your configuration and SQL within a snapshot jinja block like so:

{% snapshot orders_snapshot %}

{{
    config(
      target_database='analytics',
      target_schema='snapshots',
      unique_key='id',
      strategy='timestamp',
      updated_at='updated_at'
    )
}}

select * from {{ source('jaffle_shop', 'orders') }}

{% endsnapshot %}

Why? (you might ask)

The story begins…

Snapshots are a really ancient dbt feature -- implemented as dbt archive in https://github.com/dbt-labs/dbt-core/pull/183 and first released in 0.5.1, just two days shy of dbt's 6 month anniversary.

There were no snapshot blocks and  snapshots/*.sql files in these early days.

Instead, they were originally declared within dbt_project.yml like this:

archive:
    - source_schema: synced_production_data  # schema to look for tables in (declared below)
      target_schema: dbt_archive             # where to archive the data to
      tables:
        - source_table: some_other_table
           target_table: some_other_table_archive
           updated_at: "updatedAt"
           unique_key: "id"

A glow up

https://github.com/dbt-labs/dbt-core/issues/1175 and https://github.com/dbt-labs/dbt-core/pull/1361 allowed snapshots to escape YAML Land and become:

select statements, defined within a snapshot block in a .sql file

{% snapshot your_snapshot_name_here %}

{{ config(
          target_database='<optional database name>',
          target_schema='<schema name>',
          target_table='<table name>',
          strategy='check',
          unique_key='id',
          check_cols=['object_status', 'object_name'],
) }}

-- your select statement here

{% endsnapshot %}

At the time the thought was, “we should/will reimplement all the resources like this” (so that you could define multiple “model blocks” in a single file).

Turns out that defining multiple resources in one file makes

and so in the leadup to v1.0, it wasn’t a priority to do this rework — and finally decided it wasn’t really even desirable.

Future State

WE ARE GOING WITH OPTION 1

Option 1: Snapshots are just yml configs, they contain no logic (like exposures, sources, tests, etc.)

# snapshots/my_snapshots.yml
snapshots:
  - name: orders_snapshot
    relation: source('jaffle_shop', 'orders')
    config:
      schema: snapshots
      database: analytics
      unique_key: id
      strategy: timestamp
      updated_at: updated_at

Option 2: Snapshots are just models, they are a materialization (like incremental, view, table, etc.)

# models/my_snapshot.sql

{{
    config(
      target_database='analytics',
      target_schema='snapshots',
      unique_key='id',
      strategy='timestamp',
      updated_at='updated_at',
      materialized='snapshot'
    )
}}

select * from {{ source('jaffle_shop', 'orders') }}

Option 3: Snapshots are just sql files in the snapshots folder, but they don’t use jinja blocks (one .sql file per snapshot)

# snapshots/my_snapshot.sql

{{
    config(
      target_database='analytics',
      target_schema='snapshots',
      unique_key='id',
      strategy='timestamp',
      updated_at='updated_at'
    )
}}

select * from {{ source('jaffle_shop', 'orders') }}

Which is best?

  Option 1: Snapshots are just yml configs Option 2: Snapshots are just models Option 3: Snapshots are just sql files
enforce best practice that snapshot is just select * X    
can have SQL logic being snapshotted X (snapshot ephemeral model with logic) X X
easy migration path for users (does this matter if we keep prior art for backwards compatibility?)   X X
mapping new to old (define multiple snapshots in 1 file, etc.) X    
minimal consequences of other things we’d need to change / consider (commands, configs, etc.) X   X
define your snapshots in your models directory X X

Notes

Related issues

https://github.com/dbt-labs/dbt-core/issues/4761 https://github.com/dbt-labs/dbt-core/issues/9033

christineberger commented 3 months ago

Personally I like option 2! As far as the questions on that one, here are my thoughts:

tommyh commented 1 month ago

how strongly do we believe in the “snapshots should only be select *” best practice

IMHO, this is the most important question of this epic, and I think you can't truly answer the questions like "dbt run vs dbt snapshot" without answering the "best practice" question.

I believe there are 2 very very different types of snapshots:

1) If I take a snapshot of a raw source model, that data is still "raw" in nature. I don't want it exposed in the ANALYTICS database at all because it hasn't been cleaned. I believe that if you are snapshotting data from RAW the target should be RAW_SNAPSHOTS.

If I take a snapshot of a dbt model dim_users or top_users_by_profile_type so that I can have history of a "business model", I do want those written to ANALYTICS database.

2) Snapshotting a raw source model is "essentially" getting Fivetran History Mode for a data source where FHM is too expensive or not possible. While there are a few differences between dbt snapshots and FHM, the biggest one is the granularity of the Type2 data captured. Let's imagine I have Fivetran syncing postgres data every 15 minutes, but I have dbt cloud doing an analytics dbt run every hour.

For a snapshot of a raw source model, I would want those snapshots to run after every fivetran sync (every 15 minutes or so).

For a snapshot of a dbt model, I would want those snapshots to run during every analytics dbt run (every hour or so).

3) This brings me to the biggest difference: I want "raw snapshots" to run at the beginning of my DAG, so that they can be referenced by staging models, but I want "dbt model snapshots" to run at the end of my DAG, so that I can have history data on top_users_by_profile_type.

4) Then there is the question of "should my snapshot models have a developer schema?". In general, I think the answer is "yes", because I don't want things which developers are doing in dbt cloud ide to affect production.

While this pushes me towards the snapshot target should have the developer schema prefix, even this behavior varies drastically between (what I consider) the 2 different types of snapshots:

With all of the above ramblings, there is a very strong overlap with orchestration and data mesh..... which I do not take lightly. The thing I'm curious about: if the "sensible defaults" for "raw snapshots" vs "model snapshots" were clear enough:

NOTE: It is possible to solve all of the above with the current snapshots implementation with very careful use of tags, and folder structure, but it's not only fairly brittle but very confusing to understand.

FishtownBuildBot commented 6 days ago

Opened a new issue in dbt-labs/docs.getdbt.com: https://github.com/dbt-labs/docs.getdbt.com/issues/6122

peterallenwebb commented 6 days ago

This is now done and merged. I updated Option 1 in the issue description to match the implementation. Most properties remain in the config section under the snapshot. This matches the way those same properties are set when defining snapshots via the existing SQL method.