dbt-labs / dbt-core

dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.
https://getdbt.com
Apache License 2.0
9.54k stars 1.58k forks source link

Support Materialized Views #1162

Closed clausherther closed 4 months ago

clausherther commented 5 years ago

Docs: https://docs.snowflake.net/manuals/sql-reference/sql/create-materialized-view.html

Ideally this would also support the cluster by keyword:

create materialized view "schema"."my_mview"
cluster by ( id ) 
as 
select 1 as id,
         2 as name
clausherther commented 5 years ago

Looks like this was added in 0.13.1 (PR #1432 )

drewbanin commented 5 years ago

@clausherther not so! #1432 fixed a problem where dbt couldn't run if a materialized view lived in the dbt schema. dbt still does not support the creation of materialized views on Snowflake, though it is something I've been experimenting with recently.

In your mind, what's the advantage of using a materialized view over a dbt table model that's refreshed with some cadence? I seem to recall some pre-release Snowflake functionality that would auto-update materialized views, but I'm struggling to find that info presently

clausherther commented 5 years ago

Ah, I misread this in the release notes: "Add "MaterializedView" relation type to the Snowflake adapter"

Personally, I don't think I have a good use case for MVs in Snowflake, that need has really been largely replaced by dbt. I think I opened the issue alongside the one for transient tables, to keep track of new Snowflake features you potentially want to support.

drewbanin commented 5 years ago

alright, thanks for the additional context! Happy to leave this open to track the Snowflake feature, but also welcome input from anyone who đź‘Ť'd the issue to chime in with some use cases!

jrandrews commented 5 years ago

I'm trying to think how we would deal with MViews in dbt. You don't really want to drop and recreate them every time because then there would be a lot of extra time and cost by Snowflake bringing them up-to-speed with the underlying data. They do act like views when they are "stale" or not completely refreshed so at least there wouldn't be any problems with what happens if they are queried if not completely rebuilt:

Data accessed through materialized views is always current, regardless of the amount of DML that has been performed on the base table. If a query is run before the materialized view is up-to-date, Snowflake either updates the materialized view or uses the up-to-date portions of the materialized view and retrieves any required newer data from the base table. https://docs.snowflake.net/manuals/user-guide/views-materialized.html

Perhaps they would be like incremental tables where they are only re-created when a) they don't exist or b ) when forcing a full refresh?

There is also the potential for a lot of run-time error with MViews in Snowflake because they are limited in what they support from a SQL perspective. My concern is that it would be pretty difficult for dbt to parse and do a compile-time validation of what is not legitimate for an MView which would lead in many cases to dbt runs failing if the MView gets compiled but then fails creation or update at run-time. The list of things is too long for me to write here but see link above.

mike-weinberg commented 4 years ago

@drewbanin materialized views are useful for:

  1. accelerating external tables with automated refreshes (better than using pipes)
  2. maintaining a high-water mark on CDC tables eg
    CREATE MATERIALIZED VIEW cdc_highwater 
    AS 
    (
    SELECT primary_key, column_name, MAX(wal_ts) AS wal_ts 
    FROM cdc_table
    )

    to accelerate consolidation into current-state tables like:

    SELECT  primary_key
        , DECODE(column_name, 'col1', value) AS col1
        , DECODE(column_name, 'col2', value) AS col2
        , ...
        , DECODE(column_name, 'colN', value) AS colN
    FROM cdc_table JOIN cdc_highwater USING (primary_key, column_name, wal_ts)

Frankly I think only use case 1 is value-generating, since the performance improvements are (allegedly) nontrivial for end users.

drewbanin commented 4 years ago

Thanks @mike-weinberg! I don't think it would be too challenging to create a materialized_view materialization in dbt. I think the logic would look like:

I still think that the list of caveats are too restrictive for most modeling use cases (no window functions, no unions, limited aggregates, can't query views, etc etc etc). Still, happy to support these as I do think the first use-case you've pointed out here @mike-weinberg is good and legitimate. Plus, I imagine that these materialized views will only become more capable over time.

If someone is interested in creating this materialization, it all should be pretty doable in user-space! Would love to discuss the implementation further if anyone picks it up :)

automagic commented 4 years ago

Redshift is also planning on supporting materialized views (in preview):

https://docs.aws.amazon.com/redshift/latest/dg/mv.html

It appears that many of the same limitations apply as with Snowflake materialized views. https://docs.aws.amazon.com/redshift/latest/dg/mv-usage-notes.html

In addition, Redshift appears to keep the underlying materialized data static until a refresh command is issued at which time it either incrementally updates the data if certain conditions are met, or does a full rebuild. https://docs.aws.amazon.com/redshift/latest/dg/mv-refresh.html#mv_REFRESH_MARTERIALIZED_VIEW_usage

While I agree that there will be limited cases where this makes sense in modeling, surely there are valid and interesting use cases.

drewbanin commented 4 years ago

Thanks @automagic! I used to be not-so-bullish on Materialized views, but the idea of these incremental refreshes is really appealing. One thing I particularly like about them: refreshing a table can happen concurrently with external queries. When dbt replaces tables, it currently blocks if there are long-running select statements hitting the table.

The one challenge here is going to be knowing when dbt should rebuild the MV, vs. just refresh it. We can leverage the --full-refresh flag, but that feels a little heavy handed. It would certainly be pretty cool if dbt could check if the columns in the view have changed, a la #320. Even better would be some sort of logical comparison (eg. check the hash of the SQL that created the model against the hash of the new model).... but that could be something for a separate issue :)

jonah-juul commented 4 years ago

BigQuery has a similar feature https://cloud.google.com/bigquery/docs/materialized-views-intro

Would definitely be useful

bashyroger commented 4 years ago

Even with the limitations native Snowflake MV's have, supporting them in DBT will become more useful once the Transparent MV usage functionality has been released (in private preview at the moment of writing this). Basically this is functionality akin to Oracle's QUERY_REWRITE. See my write up on this new feature (and others) and the SF new features video that mentions it: https://www.linkedin.com/pulse/my-pov-new-snowflake-features-announced-june-2020-werschkull-/

jtcohen6 commented 4 years ago

Hey @bashyroger, check out:

mike-weinberg commented 4 years ago

Since this thread is warming up, let's make it spicy 🌶️ !

Snowflake recently announced a clever feature they're calling "transparent materialized views".

In the future, if a plan node of a query is equivalent to or is a subset of the plan node of a materialized view, then the materialized view will be used, even if it isn't referenced, because it is provably equal.

So, if you identify common aggregations, and build materialized views for those aggregations, then queries which have steps which are equivalent to those materialized views will be rewritten by the planner to use those materialized views. Alternatively, you could maintain multiple copies of the same table, but with different clustering keys, and snowflake will automatically pick the best one. This is analogous to Vertica's concept of projections.

Personally, I think this is a big deal, and if snowflake expands support to include INNER JOIN and UNION ALL clauses, then chains of MV's might become the best way to express many dags. Even in the absence of that, the ability to maintain copies of tables with various clustering keys means that MVs could accelerate regular views enough to enable decent-performing low latency reporting when combined with a CDC ELT vendor.

EDIT: When I wrote this, I either didn't see the prior two comments, or my page needed to be refreshed. Embarrassing! I'll leave this here though =)

nareshkumaryadav commented 3 years ago

What is the latest state on this one ? Is this issue still open ?

kamleshsinhchauhan commented 3 years ago

I also want to know this as I am going to source control snowflake and views also need to source control like tables.

jtcohen6 commented 3 years ago

Hey @nareshkumaryadav, this issue is definitely still open.

All of which is to say: I'd find it really valuable to hear more about what you're thinking of doing with MVs in your own project.

@kamleshsinhchauhan It sounds like what you're after may be slightly different: we have long supported view and table as materialization types on Snowflake. Check out the docs on materializations. If you are thinking about Snowflake materialized views, then this is the right thread!

fairchild commented 3 years ago

I thought I'd share another use case. We are considering use of materialized views in snowflake. Our use case is to load Debezium cdc data to a staging table with variant column that accumulates the stream. Materialized views would be created from the staging table to a history table for the corresponding source table. We could create the materialized views out of band from DBT and then treat them as DBT sources, but would prefer to have it all in DBT if possible.

It would be great to hear if folks are having success with the experimental implementation, and/or if that is likely to make its way into a dbt release in the near future.

tomharrisonjr commented 3 years ago

Thanks @automagic! I used to be not-so-bullish on Materialized views, but the idea of these incremental refreshes is really appealing. One thing I particularly like about them: refreshing a table can happen concurrently with external queries. When dbt replaces tables, it currently blocks if there are long-running select statements hitting the table.

The one challenge here is going to be knowing when dbt should rebuild the MV, vs. just refresh it. We can leverage the --full-refresh flag, but that feels a little heavy handed. It would certainly be pretty cool if dbt could check if the columns in the view have changed, a la #320. Even better would be some sort of logical comparison (eg. check the hash of the SQL that created the model against the hash of the new model).... but that could be something for a separate issue :)

Not sure I understand why this is an issue? Redshift does support MVs now, including automated refresh, and internally knows whether they can be incrementally updated or require a full update. The refresh occurs asynchronously (scheduled to avoid conflict with other uses). dbt should simply recognize that when a view is MV, it simply needs to leave it alone unless the query itself has changed.

tomharrisonjr commented 3 years ago

Our analytics team is suffering with poor performance of dbt generated views. We have confirmed that a redshift-managed materialized view runs in a second or two compared to the standard view created by dbt which now takes several minutes to run (so: 2 orders of magnitude faster in this case). They have tried dbt options, but the tradeoffs are poor for their use-case. Is there any update or plan to support redshift (and bigtable and snowflake) materialized views soon?

jtcohen6 commented 3 years ago

including automated refresh

I didn't know this! I just updated the sample code in https://github.com/fishtown-analytics/dbt-labs-experimental-features/pull/11 to add an auto_refresh config on Redshift. That's really compelling; in my anecdotal experience, the manual refresh materialized view statement was quite slow to run.

The question at this point isn't, Can you use materialized views with dbt? You absolutely can, by including some custom macros and materializations in your project. The question is, Do we have a recommended implementation of MVs we feel great about, such that we want to set it as the standard for all new projects going forward?

Materialized views have come a long way in just the past year, but they still have significant limitations. Particularly on Redshift, MVs do not support late binding, and they can't be identified in pg_depend (https://github.com/awslabs/amazon-redshift-utils/issues/499), so it's impossible for dbt to know when they've been cascade-dropped. When an MV needs to be replaced (whether because its definition has updated, or the schema of the underlying table has changed), I don't see a way to do this both reliably and atomically.

LoekL commented 3 years ago

We support a large number of event types which we split out in separate tables (one for each event) for a variety of reasons. Currently we keep all these up-to-date by running hourly incremental updates. When moving this part of our setup to materialized views:

  1. We have way less moving parts / stuff to maintain. I also suspect, though depending on your pricing model, it can be cheaper.
  2. We get side-benefits like "smart tuning" on BigQuery (reference).
  3. Our DBT pipeline has to take care of way less models which means it'll be done quicker with the models that aren't suitable for MV's.
bytekast commented 3 years ago

+1 Would really love to see this implemented. Auto refreshed materialized views in Redshift are much more performant and efficient than incremental materialization based on my tests.

kristiyan-dimitrov commented 3 years ago

Adding my +1 for materialized views. Would be great to have support for it in dbt, we're using Postgres.

poudrouxj commented 3 years ago

+1 following

maver1ck commented 3 years ago

Guys, How this one is working ? https://materialize.com/dbt-materialize-jaffle-shop-demo/

amychen1776 commented 3 years ago

@maver1ck

Guys, How this one is working ? https://materialize.com/dbt-materialize-jaffle-shop-demo/

I'm aware of 2 people using the Materialize adapter in production. Check out the #db-materialize channel in slack to keep up to date/engage with others interested in the adapter.

raajpackt commented 2 years ago

+1 for redshift materialized views

rarup1 commented 2 years ago

+1 for redshift materialised views built into dbt.

andresbravog commented 2 years ago

+1 for Bigquery materialised views

calebsmac commented 2 years ago

+1 for Snowflake materialized views

benjaminwootton commented 2 years ago

Materialised Views have definite uses in anything latency sensitive. The warehouse will update the MV on insert into the underlying table rather than waiting for DBT to run.

It would be great to create these from DBT and then create references to them for downstream batch DBT runs.

Not sure what the plan is with Clickhouse, but that has additional features around MVs such as their SummingMergeTree and ReplacingMergeTree engine which can also be really useful to get the semantics you need.

Another +1 for this feature, Snowflake primarily.

ckaiser-graniterock commented 2 years ago

+1 MV's Snowflake

grimwm commented 2 years ago

+1 MVs in Postgres

sql-dr-watson commented 2 years ago

+1 MV's in Snowflake

cwelton commented 2 years ago

+1 for MVs in general, but given this has been open since 2018 and still no MVs... not so hopeful.

smitsrr commented 2 years ago

At this point we might need a community contributor to write the custom materialization with the logic @drewbanin outlined above. https://github.com/dbt-labs/dbt-core/issues/1162#issuecomment-530519585

ybressler commented 2 years ago

+1 for MVs

Would help solve this issue with cloning: https://github.com/dbt-labs/dbt-snowflake/issues/118

hsenger-tiger commented 2 years ago

Any update here from the DBT side? This would be very helpful for the performance reasons mentioned many times earlier.

jspreddy commented 2 years ago

+1 really need this for performance.

simonmd commented 2 years ago

+1 MVs in Postgres

kgeis commented 2 years ago

I am using PostgreSQL, and when I started using dbt, I really wanted it to support MVs in PostgreSQL. That was when my thinking was to run dbt on demand to create the structure of my data mart. Since then I think I've come more to the "dbt way" where I run dbt on a schedule in order to populate my data mart.

In PG, there is no standard way to schedule MV refreshes and no support for log-based refresh of MVs. This says to me that an MV in PG is equivalent to materializing a dbt model as a table and then scheduling a dbt run at the OS level to refresh it.

I see the value of MV support for platforms with either internal scheduling (e.g. Oracle REFRESH clause) or incremental refresh (e.g. Oracle FAST clause) of MVs based on underlying changes, such as Oracle or I think Snowflake. I no longer feel like I need this for PG.

mike-luabase commented 2 years ago

+1 for clickhouse MV's!

Bruno1298 commented 2 years ago

+1 for MV BigQuery and Snowflake

MarcinZegar commented 1 year ago

Any updates on this ticket?

willbryant commented 1 year ago

Use https://github.com/dbt-labs/dbt-labs-experimental-features/tree/main/materialized-views and give feedback there.

wmoustafa commented 1 year ago

Sharing some early work on integrating DBT, Coral, and Spark to implement materialized views and incremental view maintenance on various engines, but starting with Spark. This DBT extension defines a new materialization mode called incremental_maintenance that automatically evaluates the query incrementally. Source code can be found under coral-dbt and coral-incremental. There are also some slides going through the design/implementation. Please feel free to test it, try it, or extend it.

dbeatty10 commented 1 year ago

Thank you for sharing @wmoustafa ! The slides were well-written and explained a couple big ideas (transpilation via Coral and incremental_maintenance as a drop-in replacement for the table materialization).

dbeatty10 commented 4 months ago

Closing in favor of: