dbt-labs / dbt-core

dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.
https://getdbt.com
Apache License 2.0
9.58k stars 1.59k forks source link

[CT-2468] [Feature] For versioned models, automatically create view/clone of latest version in unsuffixed database location #7442

Open jtcohen6 opened 1 year ago

jtcohen6 commented 1 year ago

Thanks @dbeatty10 @joellabes for working through this with me!

As the producer of a versioned model, I want the unsuffixed database relation (<db>.<schema>.<model_name>) to always point to the latest version of that model. This provides downstream consumers with the same optionality that's available to ref: "pin" to a specific version (suffix), or "unpin" (no suffix) and always get the latest.

dbt should handle this for me, by creating an additional "pointer" view (select * from <latest>) or table clone (depending on the relation type and data platform support) whenever I run the latest version of my model.

Implementation options

I'm not in love with any of these options:

  1. Create a new node in the manifest with this logic, materialized as a view (or clone?). Not sure if/how we'd be able to set configurations on that additional node/object (e.g. persisting descriptions).
  2. Update the view, table, and incremental materializations to produce an additional relation, if the model is versioned (and this is the latest version)
  3. Provide users with a macro that they can "set and forget" as a model post-hook

Of those, (3) is the least magical, and it's what we'll propose to users in the short term (v1.5). We still don't like:

jtcohen6 commented 1 year ago

We have three implementation options. Each has trade-offs around user limitations and complexity of implementation.

Next step: Let's spike each of the options, and make a call.

nicohein commented 1 year ago

Hi All,

I would like to propose a fourth option, that is a little bit more verbose but adds flexibility:

Option 4

The main model itself remains unchanged. It is expected to be implemented in dim_customers.sql. The materialization with the version number of the main model becomes optional. If you want to have this materialization, then you create an item in the versions.

- models:
  - name: dim_customers
     # it essentially becomes obsolete to specify the main/latest version
    config:
      materialized: table
      meta:
        key: value
    columns:
      - name: customer_id
        description: This is the primary key
        data_type: int
      - name: ...
    versions:
      - version: 1
        implemented_in: dim_customers_OLD
        deprecation_date: '2022-01-01'
        config:
          meta: {} #remove the meta from v2
        columns:
          include: ...
          exclude: ...
      - version: 2
        implemented_in: dim_customers
      - version: 3
        description: A dramatic reimagining of our customers model with no column overlap? What are they thinking?!
        implemented_in: dim_customers_NEW_BUT_NOT_READY
        # deprecation_date is optional and not included for this version
        columns:
          - name: totally_new_column_1
          - name: totally_new_column_2
          - name: totally_new_column_3
          ...

This would result in a second table being created called dim_customers_v2. If the developer instead choses to materialize _v2 as a view thats possible by instead writing:

      - version: 2
        # implemented_in: dim_customers_v2   # expected default 

And adding the corresponding .sql file:

{{ config(materialized="view") }}

select * from {{ ref('dim_customers') }}

If a team decides to have _v2 materialized as a table and define dim_customers as a view, that is now equally possible.

One remaining challenge I can think of is that a developer could introduce a drift between the main model and v2 because there is simply no guarantee of those models being equal. I regard this risk as low because whoever will make use of models, will put some thought behind the versioning model.

Spince commented 1 year ago

this is concerning to me.. the whole purpose of versioning an api is to NOT introduce breaking changes. unless every model is postfixed with _v1 from the beginning the proposed solutions would cause breaking changes. ie- dim_customers was created but is now being modified, the above solution places a view where dim_customers "v1" originally existed

i need to give it more thought but it seems like the pointer object that routes a query to the latest version should instead be postfixed with _latest. i think this needs to be considered in terms of implemention on an existing project that's been around for years rather than a clean slate project that's perfectly configured according to dbt features and standards that weren't even discussed or codified until years after the original dim_customers table was created.

nicohein commented 1 year ago

I think I still would expect dim_customers to behave just like like dim_customers_latest. Nevertheless would the proposal for option 4 handle different ways of implementation:

dim_customers unchanged & latest materialized as table

- models:
  - name: dim_customers
    config:
      materialized: table
      meta:
        key: value
    columns:
       ...
    versions:
      - version: latest  # just like a tag in docker 
        implemented_in: dim_customers_v2
        alias: dim_customers_latest  # just in case default behavior results in dim_customers_vlatest
      ...
      - version: 2
        # implemented_in: dim_customers_v2  
      ...

dim_customers unchanged & latest as view

- models:
  - name: dim_customers
    config:
      materialized: table
      meta:
        key: value
    columns:
       ...
    versions:
      - version: latest
        alias: dim_customers_latest
      ...
      - version: 2
        # implemented_in: dim_customers_v2
      ...

and dim_customers_latest.sql:

{{ config(materialized="view") }}

select * from {{ ref('dim_customers', version="2") }}

dim_customers like latest & v2 + latest as view

- models:
  - name: dim_customers
    config:
      materialized: table
      meta:
        key: value
    columns:
       ...
    versions:
      - version: latest
        alias: dim_customers_latest
      - version: 2
        implemented_in: dim_customers_latest
     ...

and dim_customers_latest.sql:

{{ config(materialized="view") }}

select * from {{ ref('dim_customers') }}

Happy to clarify any questions around this idea and receive some more feedback.

Spince commented 1 year ago

@nicohein I guess I should have been more specific. I don't agree with I want the unsuffixed database relation (<db>.<schema>.<model_name>) to always point to the latest version of that model from the original comment here.. if I was implementing such a feature I would not do it that way.

jtcohen6 commented 1 year ago

From discussion yesterday:

Next step:

joellabes commented 1 year ago

My preference is Option 2.5, where it happens automatically but the logic is implemented in an overridable macro in the vein of generate_alias_name so that you can modify it if you want to. You could even do something like

models:
  - name: some_model
     versions: 
       - v: 1
         ...
       - v: 2
         config: 
           generate_pointer_view: false

if you wanted model-by-model control, right?

alison985 commented 1 year ago

Link to a Slack message with the way I think about this. https://getdbt.slack.com/archives/C2JRRQDTL/p1692185439979629?thread_ts=1692185332.030239&cid=C2JRRQDTL Specifically:

Rename foobar to foobar_v1 Create foobar as view (it should know that foobar in after is not the same as foobar in before) Create foobar_v2

I also have to make sure foobar(view/MV), foobar_v1(table/MV), and foobar(table/MV) are all getting refreshed data. Assume there are some objects that I could be maintaining 3 versions of at the same time for months.

@Spince is :100: correct - you can't have an object name change version. 1 object name == 1 version == 1 model. If I'm reading correctly, I think that means that Option 4 from @nicohein with @Spince's requirements is the closest thinking. I'm not sure I understand @joellabes's latest comment.

Right now, I'm handling this in naming convention as opposed to anything in dbt syntax because I need explicit control and dbt doesn't support ALTERs and changing materialization type of an object. (I haven't thought through incremental models for this yet, but I'm sure there's interesting cases there.) If you have the benefit of starting a code base from scratch everything has a _v1 suffix, _v2 suffix, and so on. Sometimes I put the version on the schema, sometimes on the individual model. When you don't have that luxury, the non-suffixed version is _v1 implicitly.

Think of this like API version control. I have to give downstream(e.g. BI) a chance to update. The downstream customer is pinned to a version, it's just a matter if they know that explicitly(vX suffix) or not(no _vX suffix). The version append suffixes are important for organizational expectations and communications.

If I want to have dim_customers_latest, that can be an optional object creation so people can choose to be on the bleeding edge that might break at anytime(e.g. a developer release).

jtcohen6 commented 11 months ago

@Spince @nicohein @alison985 Thank you for the feedback on this! We've been working on the technical spiking that might make this mechanism possible. I'm coming back around to the questions about the desirable product / end user experience.

After reading through the comments above, this is what I'd propose:


How should users disable dbt's automatic creation of the "latest" view/clone? Ideas:

I will give that last piece a bit more thought - but I'm curious to hear how this lands overall!

alison985 commented 11 months ago

Sounds right to me. I would vote for disabling via configuration as @joellabes suggested, but I don't think it can be specified at the version level(see below).

New item: you're repeatedly saying "view/clone" instead of just view. But I don't see a proposed spec on where to define which it is in YML. I learned about dbt clone today and I don't want to use it, but presumably others will. Also, since it's a choice, I removed "view" from the keyword being proposed. I propose the following, Option A:

models:
  - name: some_model
    config:
      materialized: table

      ########
      generate_pointer: false
      # OR                                      # both can not be specified at the same time
      materialized_pointer: [view | clone]      # default to view if not specified
      ########
    versions: 
       - v: 1
         ...
       - v: 2
         # How would we know what period of time an automatic alias of a version should be disabled?
         #config:   
         #  generate_pointer_view: false
       - v: 3
         #config:
         #  generate_pointer_view: true

I thought about doing (Option B)generate_pointer: [false | view | clone] because they are mutually exclusive, but it didn't feel right to mix a boolean and a string. I guess you could do (Option C)generate_pointer: [no | view | clone] but it would introduce new boolean keywords(yes/no) to dbt YML which also feels wrong.

jtcohen6 commented 11 months ago

@alison985 Thanks for taking another look!

Re: "view/clone," I had been thinking that:

  1. If the "latest version" of this model corresponds to a table in the database (because it's materialized as table or incremental), and this DWH supports table clones, we would create a table clone. This shares all metadata attributes of the cloned table.
  2. If the "latest version" of this model is materialized as a view (or materialized view, or something else), or if this DWH does not support table clones, we would create a pointer view (create view ... as select * from ...).

The benefit of this approach is that metadata queries expecting this object to be a table would still work, if the conditions for (1) are met. But it is more complex as an approach. I get that it would be simpler to just say: This is always a pointer view.

So to answer all the outstanding questions here:

  1. Let's keep things simple, by always creating a view.
  2. The view should be named _latest.
  3. We check the config generate_latest_pointer: true|false. It is true by default, the user can set to false in any of three places: dbt_project.yml for many/all models; in the top-level model config for this versioned model; or in the config for the "latest" version of a versioned model.
  4. For more custom behavior, users could add a v: latest with their own alias config, contents, etc.
SBurwash commented 8 months ago

Hey everyone 👋 very much in love with this feature.

Question - with the pointer, would we be able to pass documentation? The goal here would be to have the pointer have the same documentation as the latest table (through the persist_docs feature)

smitsrr commented 5 months ago

Just had a production bug this morning because the new _v1 model didn't have a database object (unsuffixed) which triggered a full-refresh 😬 . I am emphatically upvoting this issue for development!

AnotherGuitar commented 3 months ago

*upvote

shahinsaleh commented 1 month ago

*upvote

geoHeil commented 2 weeks ago

upvote

rapha-pereira commented 3 days ago

upvote!

frannydelaney commented 2 days ago

upvote (is there a better way to do this?)

elyobo commented 1 day ago

upvote (is there a better way to do this?)

Yes, use 👍 on the issue