Velir / dbt-ga4

dbt Package for modeling raw data exported by Google Analytics 4. BigQuery support, only.
MIT License
289 stars 128 forks source link

basic xclid concept #247

Closed dgitis closed 10 months ago

dgitis commented 12 months ago

Description & motivation

Resolves #234

The purpose of this PR is to give users the option to configure whatever click IDs that they want to track like gclid, fbclid as well as numerous other attribution and ad platform click IDs.

Todo:

Checklist

dgitis commented 12 months ago

As per @willbryant's request in the linked issue, this attempts to put these click IDs into records under xclid.name and xclid.value similar to the event_params and user_props.

The issue that I'm having is removing the null values from this code.

        {% if var('xclid', none) != "none" %}
            array_agg(
                {%- for clid in var('xclid') -%}
                    if( substr_contains( 'page_location' , {{clid + "="}} )
                        struct(   
                            '{{clid}}' as name,
                            {{ extract_query_parameter_value( 'page_location' , clid ) }} as value
                        )){% if not loop.last %},{% endif %}
                    {% endif %}
                {%- endfor -%}
            ) as xclid,
        {% endif %}

The problem here is that I need to mix Jinja and SQL conditions to add or exclude the trailing comma in {% if not loop.last %},{% endif %}.

I believe this code will break when you have trailing xclid array variable values that aren't used because the code will pass the first, Jinja condition {%- for clid in var('xclid') -%} but fail the second, SQL condition if( substr_contains( 'page_location' , {{clid + "="}} ).

The final Jinja condition, {% if not loop.last %},{% endif %} will add multiple trailing slashes when it fails the SQL condition.

I'm not sure how to handle this. Can I maybe nest this in a select statement with a where name is not null?

adamribaudo-velir commented 12 months ago

My suggestion would be to generate a table of event keys and xclid param/values and then pivoting it (this would require new models). So that

event_key param_name param_value
1 gclid foo
1 tclid bar

becomes

event_key gclid tclid
1 foo bar

You can later join the pivoted param data w/ the event data

Also, can you mark this as a draft PR if it's not yet ready for code review?

willbryant commented 12 months ago

Separate tables are much less efficient to query though, what's the advantage?

adamribaudo-velir commented 12 months ago

If I understand the goal of the PR correctly, we're trying to pivot X number of URL parameters from rows to columns and the value of X can change from record to record. That sounds like a pivot operation, yes? Whether it's a CTE or separate model I don't have a strong opinion, I'm just recommending looking into using a pivot rather than using Jinja to parse the number of parameters included in each record.

You can always incrementally join this data back with the original event records so that the practical effect would be 1 table with the necessary columns available.

Also, don't forget that we already have stg_ga4__event_to_query_string_params which generates the first table in my comment.

dgitis commented 12 months ago

I've been thinking on these requirements and I think maybe this particular solution isn't quite right.

I have an issue with creating a record because, unless you are doing data viz on Looker Studio, users will need to unnest that record in their marts. The place where I would expect this unnesting to be done is in stg_ga4__events where we are currently planning on creating this issue.

Instead, I think we should adopt two variables: mutually_exclusive_click_ids and non_mutually_exclusive_click_ids.

For most users, the first variable is all that they need. It's for gclid, fbclid, and all of the click IDs that come from their own specific ad inventory and can't be mixed.

However, for the attribution platforms, that can be plugged in to ad platforms adding a second click ID to the ad platform, users would need to use the non_mutually_exclusive_click_ids variable.

The click IDs in the mutually exclusive variable would populate two dedicated text fields in stg_ga4__events: xclid_name and xclid_value.

The click IDs in non-mutually exclusive variable would create a new column for each with the variable as the name that would populate with that ID.

I know that Adam prefers to avoid variables when possible, but using these two variables helps minimize the database size while keeping it flat to maximize compatibility.

dgitis commented 12 months ago

If we push this in the separate table direction, then we'd want to do it on the session_key rather than the event_key and we'd only populate where the click IDs are not null.

This might end up being more efficient as we wouldn't be having null values in rows from (often the majority of) traffic that doesn't have click IDs.

willbryant commented 12 months ago

OK. Given the feedback here, and that I have not found any cases where we see non-mutually-exclusive tracking IDs on a single page request, maybe we need to wind it back a bit towards what you started with @dgitis and just do a single name & a single value column on page requests, and only handle the multiple case at the session level?

dgitis commented 10 months ago

RESOLVED BY #252