[Feature] No more jinja block for snapshots - new snapshot design ideas

graciegoheen commented 4 months ago

Is this your first time submitting a feature request?

[X] I have read the expectations for open source contributors
[X] I have searched the existing issues, and I could not find an existing issue for this feature
[X] I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion

Current State

To configure a snapshot currently, you must nest your configuration and SQL within a snapshot jinja block like so:

{% snapshot orders_snapshot %}

{{
    config(
      target_database='analytics',
      target_schema='snapshots',
      unique_key='id',
      strategy='timestamp',
      updated_at='updated_at'
    )
}}

select * from {{ source('jaffle_shop', 'orders') }}

{% endsnapshot %}

Why? (you might ask)

The story begins…

Snapshots are a really ancient dbt feature -- implemented as dbt archive in https://github.com/dbt-labs/dbt-core/pull/183 and first released in 0.5.1, just two days shy of dbt's 6 month anniversary.

There were no snapshot blocks and snapshots/*.sql files in these early days.

Instead, they were originally declared within dbt_project.yml like this:

archive:
    - source_schema: synced_production_data  # schema to look for tables in (declared below)
      target_schema: dbt_archive             # where to archive the data to
      tables:
        - source_table: some_other_table
           target_table: some_other_table_archive
           updated_at: "updatedAt"
           unique_key: "id"

A glow up

https://github.com/dbt-labs/dbt-core/issues/1175 and https://github.com/dbt-labs/dbt-core/pull/1361 allowed snapshots to escape YAML Land and become:

select statements, defined within a snapshot block in a .sql file

{% snapshot your_snapshot_name_here %}

{{ config(
          target_database='<optional database name>',
          target_schema='<schema name>',
          target_table='<table name>',
          strategy='check',
          unique_key='id',
          check_cols=['object_status', 'object_name'],
) }}

-- your select statement here

{% endsnapshot %}

At the time the thought was, “we should/will reimplement all the resources like this” (so that you could define multiple “model blocks” in a single file).

Turns out that defining multiple resources in one file makes

parsing slower
IDE interactive compile/preview much harder

and so in the leadup to v1.0, it wasn’t a priority to do this rework — and finally decided it wasn’t really even desirable.

Future State

WE ARE GOING WITH OPTION 1

Option 1: Snapshots are just yml configs, they contain no logic (like exposures, sources, tests, etc.)

# snapshots/my_snapshots.yml
snapshots:
  - name: orders_snapshot
    relation: source('jaffle_shop', 'orders')
    config:
      schema: snapshots
      database: analytics
      unique_key: id
      strategy: timestamp
      updated_at: updated_at

transformation is just select * from {{ source('jaffle_shop', 'orders') }}
- best practice is “staging” layer on top of your snapshots for cleanup (could even be an ephemeral model if you don’t want clutter in your warehouse)
- exceptions where you’d want to do light transformation on your source, then snapshot (or bake logic into snapshot):
- types of logic: filters, deduplication, surrogate → unique key, etc.
- best practice → snapshot an ephemeral model that contains the logic
- This method also helps with development because you’ll be able to run the query rather than running a snapshot every time
“snapshots are a config” on a model (like tests, etc.)
[ideally] snapshots could be defined in yml in models folder (like sources) to allow for better organization (currently silo-d in snapshots folder)

Option 2: Snapshots are just models, they are a materialization (like incremental, view, table, etc.)

# models/my_snapshot.sql

{{
    config(
      target_database='analytics',
      target_schema='snapshots',
      unique_key='id',
      strategy='timestamp',
      updated_at='updated_at',
      materialized='snapshot'
    )
}}

select * from {{ source('jaffle_shop', 'orders') }}

if snapshots are just models, would they now be included in dbt run? vs. dbt snapshot?
how would we handle snapshot-paths? would you be able to put them in your models folder? or only snapshots folder?
what would happen to their resource_type - model or snapshot (incremental is model)? if we went with model, that would be an id change, adjustment to selector syntax for dbt build, DBT_EXCLUDE_RESOURCE_TYPE wouldn’t work for excluding snapshots, etc.
this would be the only (?) model materialization where dbt injects its own meta-fields into the final table
how would a migration path look like from one materialization to another? table -> snapshot? incremental -> snapshot? etc?
materialized='snapshot' or materialized='scd'

Option 3: Snapshots are just sql files in the snapshots folder, but they don’t use jinja blocks (one .sql file per snapshot)

# snapshots/my_snapshot.sql

{{
    config(
      target_database='analytics',
      target_schema='snapshots',
      unique_key='id',
      strategy='timestamp',
      updated_at='updated_at'
    )
}}

select * from {{ source('jaffle_shop', 'orders') }}

snapshot name is same as .sql file name, just like models

Which is best?

I think the question between the options above comes down to:
- how strongly do we believe in the “snapshots should only be select *” best practice
- how much do we care about migration paths
- what other things would we need to change (commands, configs, etc.)
- what’s the technical lift for each of these (can we map the new → old)

	Option 1: Snapshots are just yml configs	Option 2: Snapshots are just models	Option 3: Snapshots are just sql files
enforce best practice that snapshot is just select *	X
can have SQL logic being snapshotted	X (snapshot ephemeral model with logic)	X	X
easy migration path for users (does this matter if we keep prior art for backwards compatibility?)		X	X
mapping new to old (define multiple snapshots in 1 file, etc.)	X
minimal consequences of other things we’d need to change / consider (commands, configs, etc.)	X		X
define your snapshots in your models directory	X	X

Notes

we will keep the prior art, for backwards compatibility

christineberger commented 3 months ago

Personally I like option 2! As far as the questions on that one, here are my thoughts:

if snapshots are just models, would they now be included in dbt run? vs. dbt snapshot? No, they still function like snapshots because the nature of snapshots needs to stay intentional. With an incremental (as long as you set it up correctly), you are only adding the delta or updating, which will happen at some point or another anyways. However, the use of a snapshot is literally just like a picture! Folks tend to want the "clean" version many more times than they want the blurry shots (i.e, once a day).
how would we handle snapshot-paths? would you be able to put them in your models folder? or only snapshots folder? I have had customers who had a desire to configure snapshot models within the models folder. I personally don't like the thought of too much flexibility on that - i.e, you could have snapshot sql files right next to your every-day any-time models. However, I do like the idea of having them next to the models in a defined way like models > snapshots or even models > subfolder > snapshots. Is it possible for this to be flexible yet restrictive to a folder called 'snapshots', somewhat like we do with tests > generic?
what would happen to their resource_type - model or snapshot (incremental is model)? if we went with model, that would be an id change, adjustment to selector syntax for dbt build, DBT_EXCLUDE_RESOURCE_TYPE wouldn’t work for excluding snapshots, etc. I think it would still be a snapshot because of what I said for bullet 1. Snapshots are actually "building" a source of data that otherwise wouldn't exist, where with an incremental (even though you can configure it like a snapshot), that use case is generally more forgiving as the data we recommend already has history that can always be referred back to. I think it makes sense that we stay somewhat protective over why we don't define it as any old model.

tommyh commented 1 month ago

how strongly do we believe in the “snapshots should only be select *” best practice

IMHO, this is the most important question of this epic, and I think you can't truly answer the questions like "dbt run vs dbt snapshot" without answering the "best practice" question.

I believe there are 2 very very different types of snapshots:

snapshots of raw source models
snapshots of dbt models

1) If I take a snapshot of a raw source model, that data is still "raw" in nature. I don't want it exposed in the ANALYTICS database at all because it hasn't been cleaned. I believe that if you are snapshotting data from RAW the target should be RAW_SNAPSHOTS.

If I take a snapshot of a dbt model dim_users or top_users_by_profile_type so that I can have history of a "business model", I do want those written to ANALYTICS database.

2) Snapshotting a raw source model is "essentially" getting Fivetran History Mode for a data source where FHM is too expensive or not possible. While there are a few differences between dbt snapshots and FHM, the biggest one is the granularity of the Type2 data captured. Let's imagine I have Fivetran syncing postgres data every 15 minutes, but I have dbt cloud doing an analytics dbt run every hour.

For a snapshot of a raw source model, I would want those snapshots to run after every fivetran sync (every 15 minutes or so).

For a snapshot of a dbt model, I would want those snapshots to run during every analytics dbt run (every hour or so).

3) This brings me to the biggest difference: I want "raw snapshots" to run at the beginning of my DAG, so that they can be referenced by staging models, but I want "dbt model snapshots" to run at the end of my DAG, so that I can have history data on top_users_by_profile_type.

4) Then there is the question of "should my snapshot models have a developer schema?". In general, I think the answer is "yes", because I don't want things which developers are doing in dbt cloud ide to affect production.

snapshot of raw source: if I run dbt snapshot in dbt cloud ide AND everything is exactly the same, while this wont "break things" per-se, it will mean that snapshot model will have been more recently then the other snapshot models, which is risky.
snapshot of a dbt model: if I run dbt_snapshot in dbt cloud ide for a top_users_by_profile_type and this updates ANALYTICS target, that is horrible.

While this pushes me towards the snapshot target should have the developer schema prefix, even this behavior varies drastically between (what I consider) the 2 different types of snapshots:

snapshot of a raw source: if I want to run a model which has a snapshot of a raw source in it's upstream, then I have to manually do a dbt snapshot before running my model, because a dbt run +my_model won't run the snapshot which is backing the staging model. This is error prone and frustrating. In the same way that I can do a dbt run without messing with the RAW database, I should be able todo a dbt run without messing with the RAW_SNAPSHOTS database. In this case, using the "shared" snapshot data (dbt_prod) is what i want.
snapshot of dbt model: If I have a snapshot model on top_users_by_profile_type I want to be able to run my snapshot over and over again in my developer target schema, so that I know it works.

With all of the above ramblings, there is a very strong overlap with orchestration and data mesh..... which I do not take lightly. The thing I'm curious about: if the "sensible defaults" for "raw snapshots" vs "model snapshots" were clear enough:

they could be simple enough for the customers who have: dbt cloud with 1 project which is essentially dbt build
they could be flexible enough for customers who have: a "dbt snapshot project" which runs after the fivetran sync, and a "dbt analytics project" which runs every hour.

NOTE: It is possible to solve all of the above with the current snapshots implementation with very careful use of tags, and folder structure, but it's not only fairly brittle but very confusing to understand.

FishtownBuildBot commented 6 days ago

Opened a new issue in dbt-labs/docs.getdbt.com: https://github.com/dbt-labs/docs.getdbt.com/issues/6122

peterallenwebb commented 6 days ago

This is now done and merged. I updated Option 1 in the issue description to match the implementation. Most properties remain in the config section under the snapshot. This matches the way those same properties are set when defining snapshots via the existing SQL method.

dbt-labs / dbt-core