Velir / dbt-ga4

dbt Package for modeling raw data exported by Google Analytics 4. BigQuery support, only.
MIT License
290 stars 129 forks source link

Build from fct table #221

Closed ryan-systematik closed 1 year ago

ryan-systematik commented 1 year ago

While running basically the same model in parallel to smooth the transition between models, I noticed that building the models from a fct_ga4__event_page_view model that I created was 40 times more efficient than from stg_ga4__event_page_view.

build-stg-vs-fct

We build from stg_ga4__event_page_view in a number of places which are all opportunities for a performance improvement. This doesn't take into account caching, but I haven't noticed much of a cache effect from model to model (cache clearly helps when re-using fields within a single model).

However, I only have anecdotal evidence of the little cache re-use between models, so I'd like to see evidence in either respect.

I see us having two options here:

  1. We make fct_ga4__event_page_view a core package model
  2. We conditionally build from fct_ga4__event_page_view if it is present; otherwise falling back to stg_ga4__event_page_view

The first option seems simpler. It has the advantage that enabling and disabling package models is a pattern that we've used elsewhere, so it's not a big leap to expect users to disable the fct_ga4__event_page_view model. However, I find myself customizing this model a lot for each client so we'd be creating a model that almost always gets disabled.

The second option requires greater complexity in the package models. I suspect many people will create fct_ga4__event_page_view models that don't error without consulting any documentation and the errors messages should make it fairly clear what is missing from any fct_ga4__event_page_view model created by users.

Is this something that we should pursue?

Is caching more effective than I give it credit?

Do you have any preferences for either of these methods?