Velir / dbt-ga4

dbt Package for modeling raw data exported by Google Analytics 4. BigQuery support, only.
MIT License
289 stars 128 forks source link

Primary and Foreign Key Optimization #296

Closed dgitis closed 5 months ago

dgitis commented 5 months ago

Setting primary and foreign keys in BigQuery causes BigQuery to eliminate joins which can greatly reduce the processing required of any query that includes a join.

This is accomplished by running queries similar to this example:

alter table dim_ga4__sessions
add primary key (session_key) not enforced,
add foreign key (client_key) references dim_ga4__client_keys(client_key) not enforced;

This query needs to be re-run when processing new partitions.

From my testing, dbt constraints doesn't appear to do anything. I can't get the primary key and foreign key indicators to appear when I set a constraint and process the table.

To take advantage of this, we need to add constraints as a post-hook. I'm thinking that we set our default values and then let users override the post-hook in the models block of the dbt_project.yml file.

It is possible, that we may have to do this after all of the models have run as the primary key needs to have been set before it can be referenced as a foreign key. Maybe there are some model sequencing tricks that we can use to support this.

willbryant commented 5 months ago

Setting primary and foreign keys in BigQuery causes BigQuery to index joins which can greatly reduce the processing required of any query that includes a join.

Are you able to provide a reference for this? I am not aware of BigQuery doing any indexing.

AFAIK the optimisations from foreign key definitions are things like join elimination (eg. assuming that an intermediate join table's rows exist without actually checking, and skipping the intermediate table if there are no columns retrieved from it): https://cloud.google.com/blog/products/data-analytics/join-optimizations-with-bigquery-primary-and-foreign-keys

dgitis commented 5 months ago

I was being lazy with my use of language. It's join elimination and not indexing that the keys optimize.

I'll edit the issue.

willbryant commented 5 months ago

Ah right. I'm surprised you have any unnecessary joins that could be eliminated?

dgitis commented 5 months ago

I want to add default primary keys to the session and user tables with the capacity to override the default alter table statements. These get joined quite frequently to event fact models, the session model in particular gets used a lot for mixing attribution with event data. Admittedly, they are not useful until you add foreign keys to the event models, but I don't think it hurts to have them there.

willbryant commented 5 months ago

You're pulling columns from them though right? So the join optimization will not eliminate them.

Sorry, not trying to say that you shouldn't do it necessarily, just that I don't think you will get any payoff for the effort you'd go to. My understanding is that if you don't have any unnecessary joins then AFAIK all it can do is make queries return weird results if any of the keys are not as unique as they are meant to be.

dgitis commented 5 months ago

I did test it on a horrible spaghetti join that some stakeholders made me do (and I hate them for it) and it had a huge effect on that particular model.

Still, I needed primary keys set on session and event models.

I haven't tested it on those basic joins. I just sort of presumed that it would help. I'll go test it and share the results here.

willbryant commented 5 months ago

Haha, cool.

I think you can set this up in DBT using the new model contracts feature (with the downside that you have to list out every column in the table, which is super tedious). Is that what you were planning to do?

dgitis commented 5 months ago

You are right, @willbryant, about setting primary and foreign keys on your tables not doing anything on basic joins.

However, dbt Contracts does not alter the tables in the way that's needed for this to do anything in situations where joins can be improved by primary and foreign keys. You need to run the alter table statements in a post-hook.

If it were more broadly useful, then I would push to make this a part of the package. Since what I'm dealing with is more of an edge case than I realized, I'm fine with either adding a model-level post-hook variable that lets us customize post-hooks on a model-by-model bases or overriding our the package models with project models in cases where I need this.

Do you have a preference? I'm good with either. I just want to implement this properly for the client.