Velir / dbt-ga4

dbt Package for modeling raw data exported by Google Analytics 4. BigQuery support, only.
MIT License
290 stars 129 forks source link

Last non-direct session attribution #196

Closed adamribaudo-velir closed 1 year ago

adamribaudo-velir commented 1 year ago

Description & motivation

Closes #110

The following model has been added which provides last non-direct session attribution: stg_ga4__sessions_traffic_sources_last_non_direct_daily

This model is similar to stg_ga4__sessions_traffic_sources_daily but provides the following new fields:

Each field contains the traffic source values from the last session with a non-direct session_source value. The effect can be seen in the graphic below where the '(none)' channel grouping is replaced with 'Organic Search'

image

A unit test has been created which mimics this behavior and validates the results.

Checklist

dgitis commented 1 year ago

I'm thinking the solution here is to build a minimal session table partitioned by day that contains session data possibly with no window functions and certainly with no lookback period. I'll call it int_ga4__sessions_daily.

If we aim for no window functions, we also add a partitioned page_view event table, fct_ga4__event_page_view.

On top of these two tables will build a daily sessions table dim_ga4__sessions or dim_ga4__sesssions_daily that uses a configurable lookback window (30 days is a good default, but this is really a business decision, so it should be in a variable. Since we have the lookback window configured here, we also reconcile cross-partition sessions in this query when hitting int_ga4__sessions_daily and get first_last page_view data from fct_ga4__event_page_view with only a one-day lookback while also resolving last_non_direct.

The goal of this architecture is to have all window functions hitting smaller tables.

Another issue that I've ran into when I set this up for a client on an old fork of the project is sessions with only first_visit and session_start events. Because we filter these events out from our session attribution, we end up failing to include those session keys in our session attribution queries, so when we join attribution data later, there's no data to join against.

We need downstream models to change null to direct or we need to decide to that we want to filter out these sessions. I think changing null to direct to be safer because it's hard to anticipate how people will use GA4. In my client's case, removing those sessions was the right call, but I'm not ready to make that generalization.