Closed graciegoheen closed 6 days ago
Personally I like option 2! As far as the questions on that one, here are my thoughts:
if snapshots are just models, would they now be included in dbt run? vs. dbt snapshot? No, they still function like snapshots because the nature of snapshots needs to stay intentional. With an incremental (as long as you set it up correctly), you are only adding the delta or updating, which will happen at some point or another anyways. However, the use of a snapshot is literally just like a picture! Folks tend to want the "clean" version many more times than they want the blurry shots (i.e, once a day).
how would we handle snapshot-paths? would you be able to put them in your models folder? or only snapshots folder? I have had customers who had a desire to configure snapshot models within the models folder. I personally don't like the thought of too much flexibility on that - i.e, you could have snapshot sql files right next to your every-day any-time models. However, I do like the idea of having them next to the models in a defined way like models > snapshots or even models > subfolder > snapshots. Is it possible for this to be flexible yet restrictive to a folder called 'snapshots', somewhat like we do with tests > generic?
what would happen to their resource_type - model or snapshot (incremental is model)? if we went with model, that would be an id change, adjustment to selector syntax for dbt build, DBT_EXCLUDE_RESOURCE_TYPE wouldn’t work for excluding snapshots, etc. I think it would still be a snapshot because of what I said for bullet 1. Snapshots are actually "building" a source of data that otherwise wouldn't exist, where with an incremental (even though you can configure it like a snapshot), that use case is generally more forgiving as the data we recommend already has history that can always be referred back to. I think it makes sense that we stay somewhat protective over why we don't define it as any old model.
how strongly do we believe in the “snapshots should only be select *” best practice
IMHO, this is the most important question of this epic, and I think you can't truly answer the questions like "dbt run vs dbt snapshot" without answering the "best practice" question.
I believe there are 2 very very different types of snapshots:
1) If I take a snapshot
of a raw source model, that data is still "raw" in nature. I don't want it exposed in the ANALYTICS
database at all because it hasn't been cleaned. I believe that if you are snapshotting data from RAW
the target should be RAW_SNAPSHOTS
.
If I take a snapshot
of a dbt model dim_users
or top_users_by_profile_type
so that I can have history of a "business model", I do want those written to ANALYTICS
database.
2) Snapshotting a raw source model is "essentially" getting Fivetran History Mode for a data source where FHM is too expensive or not possible. While there are a few differences between dbt snapshots and FHM, the biggest one is the granularity of the Type2 data captured. Let's imagine I have Fivetran syncing postgres data every 15 minutes, but I have dbt cloud doing an analytics dbt run
every hour.
For a snapshot of a raw source model, I would want those snapshots to run after every fivetran sync (every 15 minutes or so).
For a snapshot of a dbt model, I would want those snapshots to run during every analytics dbt run
(every hour or so).
3) This brings me to the biggest difference: I want "raw snapshots" to run at the beginning of my DAG, so that they can be referenced by staging models, but I want "dbt model snapshots" to run at the end of my DAG, so that I can have history data on top_users_by_profile_type
.
4) Then there is the question of "should my snapshot models have a developer schema?". In general, I think the answer is "yes", because I don't want things which developers are doing in dbt cloud ide to affect production.
dbt snapshot
in dbt cloud ide AND everything is exactly the same, while this wont "break things" per-se, it will mean that snapshot model will have been more recently then the other snapshot models, which is risky.dbt_snapshot
in dbt cloud ide for a top_users_by_profile_type
and this updates ANALYTICS
target, that is horrible.While this pushes me towards the snapshot target should have the developer schema prefix, even this behavior varies drastically between (what I consider) the 2 different types of snapshots:
snapshot of a raw source: if I want to run a model which has a snapshot of a raw source in it's upstream, then I have to manually do a dbt snapshot
before running my model, because a dbt run +my_model
won't run the snapshot which is backing the staging model. This is error prone and frustrating. In the same way that I can do a dbt run
without messing with the RAW
database, I should be able todo a dbt run
without messing with the RAW_SNAPSHOTS
database. In this case, using the "shared" snapshot data (dbt_prod) is what i want.
snapshot of dbt model: If I have a snapshot model on top_users_by_profile_type
I want to be able to run my snapshot over and over again in my developer target schema, so that I know it works.
With all of the above ramblings, there is a very strong overlap with orchestration and data mesh..... which I do not take lightly. The thing I'm curious about: if the "sensible defaults" for "raw snapshots" vs "model snapshots" were clear enough:
dbt build
NOTE: It is possible to solve all of the above with the current snapshots
implementation with very careful use of tags,
and folder structure, but it's not only fairly brittle but very confusing to understand.
Opened a new issue in dbt-labs/docs.getdbt.com: https://github.com/dbt-labs/docs.getdbt.com/issues/6122
This is now done and merged. I updated Option 1 in the issue description to match the implementation. Most properties remain in the config section under the snapshot. This matches the way those same properties are set when defining snapshots via the existing SQL method.
Is this your first time submitting a feature request?
Current State
To configure a snapshot currently, you must nest your configuration and SQL within a snapshot jinja block like so:
Why? (you might ask)
The story begins…
Snapshots are a really ancient dbt feature -- implemented as
dbt archive
in https://github.com/dbt-labs/dbt-core/pull/183 and first released in 0.5.1, just two days shy of dbt's 6 month anniversary.There were no
snapshot
blocks andsnapshots/*.sql
files in these early days.Instead, they were originally declared within
dbt_project.yml
like this:A glow up
https://github.com/dbt-labs/dbt-core/issues/1175 and https://github.com/dbt-labs/dbt-core/pull/1361 allowed snapshots to escape YAML Land and become:
At the time the thought was, “we should/will reimplement all the resources like this” (so that you could define multiple “model blocks” in a single file).
Turns out that defining multiple resources in one file makes
and so in the leadup to v1.0, it wasn’t a priority to do this rework — and finally decided it wasn’t really even desirable.
Future State
WE ARE GOING WITH OPTION 1
Option 1: Snapshots are just yml configs, they contain no logic (like exposures, sources, tests, etc.)
select * from {{ source('jaffle_shop', 'orders') }}
Option 2: Snapshots are just models, they are a materialization (like incremental, view, table, etc.)
dbt run
? vs.dbt snapshot
?snapshot-paths
? would you be able to put them in your models folder? or only snapshots folder?resource_type
- model or snapshot (incremental is model)? if we went with model, that would be an id change, adjustment to selector syntax for dbt build,DBT_EXCLUDE_RESOURCE_TYPE
wouldn’t work for excluding snapshots, etc.materialized='snapshot'
ormaterialized='scd'
Option 3: Snapshots are just sql files in the snapshots folder, but they don’t use jinja blocks (one .sql file per snapshot)
Which is best?
select *
” best practiceNotes
Related issues
https://github.com/dbt-labs/dbt-core/issues/4761 https://github.com/dbt-labs/dbt-core/issues/9033