Velir / dbt-ga4

dbt Package for modeling raw data exported by Google Analytics 4. BigQuery support, only.
MIT License
289 stars 128 forks source link

Session default channel grouping results are significantly different from the out-of-the-box values provided by GA4 #267

Closed ShaiDiamant closed 9 months ago

ShaiDiamant commented 9 months ago

Hi, The package is insanely useful, however, when trying to compare the data with the original data available in GA4, we found a significant difference in the channel grouping: From live GA4 connector on Looker Studio, we got the following number of sessions: Organic Search: X Paid Social: Y Direct: Z Let's focus on these three for now.

When using the package to count sessions based on the Session default channel grouping, the results are: Organic Search: X - 454 Paid Social: Y + 6 (none) = Direct: Z + 922

The (none) part seems to be chewing up data from the other parts.

Is that something we can somehow fix/control? Are there any other users of this package that saw a difference like that? We tried investigating through the macros and the data and the only things we can think of are:

  1. A different "attribution" model - the stg model that calculated the sources always uses "FIRST_VALUE" - maybe it should be last?
  2. Missing reference to different data points - if there is no source and no medium - the event will get a direct group. However - what if there's a referral? Shouldn't it be different?
adamribaudo-velir commented 9 months ago

For the metrics pulled from dbt-ga4, can you clarify whether you're using https://github.com/Velir/dbt-ga4/blob/main/models/staging/stg_ga4__sessions_traffic_sources_daily.sql

or

https://github.com/Velir/dbt-ga4/blob/main/models/staging/stg_ga4__sessions_traffic_sources_last_non_direct_daily.sql ?

If you want to include the actual SQL code, that could help as well

ShaiDiamant commented 9 months ago

I am using https://github.com/Velir/dbt-ga4/blob/main/models/staging/stg_ga4__sessions_traffic_sources_daily.sql

The Query I ran was: select session_default_channel_grouping, count(*) from ga4_marts.dim_ga4__sessions_daily where session_partition_date between date(2023,08,06) and date(2023,09,04) group by 1 order by 2 desc

and that was compared to a Looker Studio report that was filtering the same dates.

adamribaudo-velir commented 9 months ago

Google applies 'last non-direct' session-level attribution in their default reports. This means that they will ignore direct traffic if there was a prior non-direct source. The model you're using looks at the actual source of traffic, not the last-non-direct source.

If you'd like the last non-direct source, try using stg_ga4__sessions_traffic_sources_last_non_direct_daily.sql and see if that helps.

ShaiDiamant commented 9 months ago

Okay, now I used last_non_direct_default_channel_grouping - which instead of session_default_channel_grouping and got closer numbers. It's not identical, but that makes sense.

So now we understand that dim_ga4_sessions only has session_default_channel_grouping - which is the actual source, and dim_ga4_sessions_daily has both session_default_channel_grouping and last_non_direct_default_channel_grouping.

Sorry for the misunderstanding and thanks for the quick response :)