This dbt package connects to an exported GA4 dataset and provides useful transformations as well as report-ready dimensional models that can be used to build reports.
Features include:
page_view
, session_start
, and purchase
model | description |
---|---|
stg_ga4__events | Contains cleaned event data that is enhanced with useful event and session keys. |
stg_ga4_event* | 1 model per event (ex: page_view, purchase) which flattens event parameters specific to that event |
stg_ga4__event_items | Contains item data associated with e-commerce events (Purchase, add to cart, etc) |
stg_ga4__event_to_query_string_params | Mapping between each event and any query parameters & values that were contained in the event's page_location field |
stg_ga4__user_properties | Finds the most recent occurance of specified user_properties for each user |
stg_ga4__derived_user_properties | Finds the most recent occurance of specific event_params value and assigns them to a client_key. Derived user properties are specified as variables (see documentation below) |
stg_ga4__derived_session_properties | Finds the most recent occurance of specific event_params or user_properties value and assigns them to a session's session_key. Derived session properties are specified as variables (see documentation below) |
stg_ga4__session_conversions_daily | Produces daily counts of conversions per session. The list of conversion events to include is configurable (see documentation below) |
stg_ga4__sessions_traffic_sources | Finds the first source, medium, campaign, content, paid search term (from UTM tracking), and default channel grouping for each session. |
stg_ga4__sessions_traffic_sources_daily | Same data as stg_ga4__sessions_traffic_sources, but partitioned by day to allow for efficient loading and querying of data. |
stg_ga4__sessions_traffic_sources_last_non_direct_daily | Finds the last non-direct source attributed to each session within a 30-day lookback window. Assumes each session is contained within a day. |
dim_ga4__client_keys | Dimension table for user devices as indicated by client_keys. Contains attributes such as first and last page viewed. |
dim_ga4__sessions | Dimension table for sessions which contains useful attributes such as geography, device information, and acquisition data. Can be expensive to run on large installs (see dim_ga4__sessions_daily ) |
dim_ga4__sessions_daily | Query-optimized session dimension table that is incremental and partitioned on date. Assumes that each partition is contained within a single day |
fct_ga4__pages | Fact table for pages which aggregates common page metrics by date, stream_id and page_location. |
fct_ga4__sessions_daily | Fact table for session metrics, partitioned by date. A single session may span multiple rows given that sessions can span multiple days. |
fct_ga4__sessions | Fact table that aggregates session metrics across days. This table is not partitioned, so be mindful of performance/cost when querying. |
seed file | description |
---|---|
ga4_source_categories.csv | Google's mapping between source and source_category . Downloaded from https://support.google.com/analytics/answer/9756891?hl=en |
Be sure to run dbt seed
before you run dbt run
.
To pull the latest stable release along with minor updates, add the following to your packages.yml
file:
packages:
- package: Velir/ga4
version: [">=6.1.0", "<6.2.0"]
To install the latest code (may be unstable), add the following to your packages.yml
file:
packages:
- git: "https://github.com/Velir/dbt-ga4.git"
packages.yml
to include a reference to this package:packages:
- local: ../dbt-ga4
This package assumes that you have an existing DBT project with a BigQuery profile and a BigQuery GCP instance available with GA4 event data loaded. Source data is defined using the project
and dataset
variables below. The static_incremental_days
variable defines how many days' worth of data to reprocess during incremental runs.
vars:
ga4:
source_project: "my_source_gcp_project" # Project that contains raw GA4 data
property_ids: [11111111] # Array of properties to process
start_date: "YYYYMMDD" # Earliest date to load
static_incremental_days: 3 # Number of days to scan and reprocess on each run
When processing multiple properties at a time, the required variables change slightly. See Multi-Property Support section for details on configuring multiple GA4 properties as a source.
vars:
ga4:
source_project: "my_source_gcp_project" # Project that contains raw GA4 data
combined_dataset: "my_combined_data" # Dataset where multi-property data is cloned
property_ids: [11111111,2222222] # Array of properties to process
start_date: "YYYYMMDD" # Earliest date to load
static_incremental_days: 3 # Number of days to scan and reprocess on each run
Setting query_parameter_exclusions
will remove query string parameters from the page_location
and page_referrer
fields for all downstream processing. Original parameters are captured in the original_page_location
and original_page_referrer
fields. Ex:
vars:
ga4:
query_parameter_exclusions: ["gclid","fbclid","_ga"]
You can remove all query parameters by setting query_parameter_exclusions
to *all*
.
vars:
ga4:
query_parameter_exclusions: ["*all*"]
By default, parameters are removed from URL fragments (elements after the hash # symbol). If you wish to exclude query parameters but keep URL fragments with the same key, you will need to override the remove_query_parameters
macro.
Setting query_parameter_extraction
will extract query string parameters from the page_location
field into new columns. This can be used to extract advertising click IDs into columns that can be joined with advertising data sets. Ex:
vars:
ga4:
query_parameter_extraction: ["gclid","fbclid","keyword"]
Within GA4, you can add custom parameters to any event. These custom parameters will be picked up by this package if they are defined as variables within your dbt_project.yml
file using the following syntax:
[event name]_custom_parameters
- name: "[name of custom parameter]"
value_type: "[string_value|int_value|float_value|double_value]"
For example:
vars:
ga4:
page_view_custom_parameters:
- name: "clean_event"
value_type: "string_value"
- name: "country_code"
value_type: "int_value"
You can optionally rename the output column:
vars:
ga4:
page_view_custom_parameters:
- name: "country_code"
value_type: "int_value"
rename_to: "country"
If there are custom parameters you need on all events, you can define defaults using default_custom_parameters
, for example:
vars:
ga4:
default_custom_parameters:
- name: "country_code"
value_type: "int_value"
User properties are provided by GA4 in the user_properties
repeated field. The most recent user property for each user will be extracted and included in the dim_ga4__users
model by configuring the user_properties
variable in your project as follows:
vars:
ga4:
user_properties:
- user_property_name: "membership_level"
value_type: "int_value"
- user_property_name: "account_status"
value_type: "string_value"
Derived user properties are different from "User Properties" in that they are derived from event parameters. This provides additional flexibility in allowing users to turn any event parameter into a user property.
Derived User Properties are included in the dim_ga4__users
model and contain the latest event parameter value per user.
derived_user_properties:
- event_parameter: "[your event parameter]"
user_property_name: "[a unique name for the derived user property]"
value_type: "[string_value|int_value|float_value|double_value]"
For example:
vars:
ga4:
derived_user_properties:
- event_parameter: "page_location"
user_property_name: "most_recent_page_location"
value_type: "string_value"
- event_parameter: "another_event_param"
user_property_name: "most_recent_param"
value_type: "string_value"
Derived session properties are similar to derived user properties, but on a per-session basis, for properties that change slowly over time. This provides additional flexibility in allowing users to turn any event parameter into a session property.
Derived Session Properties are included in the dim_ga4__sessions
and dim_ga4__sessions_daily
models and contain the latest event parameter or user property value per session.
derived_session_properties:
- event_parameter: "[your event parameter]"
session_property_name: "[a unique name for the derived session property]"
value_type: "[string_value|int_value|float_value|double_value]"
- user_property: "[your user property key]"
session_property_name: "[a unique name for the derived session property]"
value_type: "[string_value|int_value|float_value|double_value]"
For example:
vars:
ga4:
derived_session_properties:
- event_parameter: "page_location"
session_property_name: "most_recent_page_location"
value_type: "string_value"
- event_parameter: "another_event_param"
session_property_name: "most_recent_param"
value_type: "string_value"
- user_property: "first_open_time"
session_property_name: "first_open_time"
value_type: "int_value"
See the README file at /dbt_packages/models/staging/recommended_events for instructions on enabling Google's recommended events.
Specific event names can be specified as conversions by setting the conversion_events
variable in your dbt_project.yml
file. These events will be counted against each session and included in the fct_sessions.sql
dimensional model. Ex:
vars:
ga4:
conversion_events: ['purchase','download']
The stg_ga4__sessions_traffic_sources_last_non_direct_daily
model provides last non-direct session attribution within a configurable lookback window. The default is 30 days, but this can be overridden with the session_attribution_lookback_window_days
variable.
vars:
ga4:
session_attribution_lookback_window_days: 90
Custom events can be generated in your project using the create_custom_event
macro. Simply create a new model in your project and enter the following:
{{ ga4.create_custom_event('my_custom_event') }}
Note, however, that any event-specific custom parameters or default custom parameters must be defined in the global variable space as shown below:
vars:
default_custom_parameters:
- name: "some_parameter"
value_type: "string_value"
my_custom_event_custom_parameters:
- name: "some_other_parameter"
value_type: "string_value"
This package assumes that BigQuery is the source of your GA4 data. Full instructions for connecting DBT to BigQuery are here: https://docs.getdbt.com/reference/warehouse-profiles/bigquery-profile
The easiest option is using OAuth with your Google Account. Summarized instructions are as follows:
gcloud auth application-default login --scopes=https://www.googleapis.com/auth/bigquery,https://www.googleapis.com/auth/iam.test
This package uses pytest
as a method of unit testing individual models. More details can be found in the unit_tests/README.md folder.
By default, this package maps traffic sources to channel groupings using the macros/default_channel_grouping.sql
macro. This macro closely adheres to Google's recommended channel groupings documented here: https://support.google.com/analytics/answer/9756891?hl=en .
Package users can override this macro and implement their own channel groupings by following these steps:
default__default_channel_grouping
that accepts the same 3 arguments: source, medium, source_categoryOverriding the package's default channel mapping makes use of dbt's dispatch override capability documented here: https://docs.getdbt.com/reference/dbt-jinja-functions/dispatch#overriding-package-macros
Multiple GA4 properties are supported by listing out the project IDs in the property_ids
variable. In this scenario, the static_incremental_days
variable is required and the combined_dataset
variable will define the dataset (in your profile's target project) where source data will be copied.
vars:
ga4:
property_ids: [11111111, 22222222, 33333333]
static_incremental_days: 3
combined_dataset: "my_combined_dataset"
With these variables set, the combine_property_data
macro will run as a pre-hook to base_ga4_events
and clone shards to the target dataset. The number of days' worth of data to clone during incremental runs will be based on the static_incremental_days
variable.
This package attempts to adhere to the Brooklyn Data style guide found here. This work is in-progress.