dbt-labs / dbt-core

dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.
https://getdbt.com
Apache License 2.0
9.29k stars 1.54k forks source link

Additional materialisation: Subquery #1248

Closed jamesrguy closed 4 years ago

jamesrguy commented 5 years ago

Feature

Add "Subquery" materialisation

Feature description

Bit of a spanner here for me, using Greenplum 4.x. There are two query planners in Greenplum, one called the legacy planner which is very clunky, and the super (not so super) “optimizer”. The optimizer reverts to the legacy planner when things get to complicated. Unfortunately neither of these can manage to rewrite CTEs sensibly. Take a query with some base models in CTEs and run them through the optimizer and it will “revert” to the legacy planner and the result is very expensive and doesn’t run. If you copy and paste the CTEs into subqueries, the optimizer doesnt revert and can do some amazing tricks!

So how difficult would it be to add another materialisation strategy “subquery”, which is like ephemeral except writes the dependent models into subqueries instead of CTEs?

(All of the above may or may not be an issue in GP5.x, but I won’t be able to test that until later in the year.)

Who will this benefit?

Users of databases which have query planners that have mixed results using CTEs, eg Greenplum

drewbanin commented 5 years ago

Hey @jamesrguy - thanks for the feature request! As I understand it, Greenplum is based on Postgres, so this limitation makes sense to me. In Postgres, CTEs are "optimization fences" in which the optimizer can't push down filters past the CTE boundary. Sounds like something similar is happening in Greenplum here.

Ephemeral models are pretty tricky to handle in dbt. They're a special case of a materialization, and they're defined and implemented deep inside of the dbt compiler. I don't think it would be a good idea to add a second subquery materialization, but I do think it could be reasonable to make the implementation details of ephemeral models configurable. I'm not exactly sure how this config should be defined, but imagine something like ephemeral_type="subquery".

If we did this, there would be two places where code would need to change:

  1. When a ref is made to an ephemeral model, dbt should return the contents of the model in a subquery (rather than the name of a CTE) https://github.com/fishtown-analytics/dbt/blob/dev/stephen-girard/core/dbt/context/runtime.py#L51-L56

  2. When the ephemeral type is set to subquery, CTEs should not be injected into models https://github.com/fishtown-analytics/dbt/blob/dev/stephen-girard/core/dbt/contracts/graph/compiled.py#L178

One of the big challenges with CTEs is that they need to be applied recursively. I think dbt's compiler will just handle this, but I'm not totally certain. It's definitely worth looking into.

Let me know if you buy this general approach. If you folks (or anyone else visiting this issue!) has the resources to try making a PR, I'd be super happy to help out!

jamesrguy commented 5 years ago

Thanks drew. That approach seems reasonable, but I'll take a look into the code to see what it might take to make it happen!

drewbanin commented 5 years ago

One other consideration here upon thinking about this some more. When dbt injects ephemeral models, it does so once at the top of the file, then refers to that CTE by name throughout the query. If we're using subqueries here, we could run into name collisions if the same ephemeral model is referenced twice in a given select statement.

eg:

select *
from {{ ref('ephemeral_model') }}
left join {{ ref('ephemeral_model') }} using (id)

Since subqueries need to be named on postgres (and i assume greenplum) this would render out to

select *
from (select * from ...) as sbq_ephemeral_model
left join (select * from ...) as sbq_ephemeral_model using (id)

Ideally we'd number these subqueries, but that might be a little tricky depending on where the subquery is constructed.

jamesrguy commented 5 years ago

In the last instance there, we would always have to name the the subqueries for them to be useful in the top part of the select, as they would need to be explicitly referenced...

I added some code to runtime.py and will test it out in the next few days:

        is_ephemeral = (get_materialization(target_model) == 'ephemeral')
        if is_ephemeral:
            is_subquery = (get_ephemeral_type(target_model) == 'subquery')
            if is_subquery:
                return '(\n{}\n)'.format(target_model.compiled_sql)
            else:
                model.set_cte(target_model_id, None)
                return adapter.Relation.create(
                   type=adapter.Relation.CTE,
                   identifier=add_ephemeral_model_prefix(
                        target_model_name)).quote(identifier=False)

and added get_ephemeral_type to utils. It seems to work!

Happy for more input if anyone has style tips, etc, in keeping with how things are done in DBT, and will report back some experience/testing.

drewbanin commented 5 years ago

cool @jamesrguy! Sounds like the next step is to make a Pull Request for your change :)

We can talk about the specifics of the code there!

MarcelvanVliet commented 5 years ago

Great to see this going into the Wilt Chamberlain release!

MarcelvanVliet commented 5 years ago

Just built this from source with Drew's extra references & it works as advertised. I'll not attempt to add to source as I've not quite got the hang of Git as yet.

drewbanin commented 5 years ago

@MarcelvanVliet quick FYI - https://github.com/fishtown-analytics/dbt/pull/1283 is blocked on a couple of minor (but pretty gnarly) issues we need to shake out, so this isn't going to make it into our Wilt Chamberlain (0.14.0) release.

Would love to hear your thoughts in that PR!

drewbanin commented 4 years ago

closing due to lack of interest and implementation challenges