Velir / dbt-ga4

dbt Package for modeling raw data exported by Google Analytics 4. BigQuery support, only.
MIT License
324 stars 135 forks source link

GA4 raw data format can break google / organic gclid fix #302

Closed dgitis closed 8 months ago

dgitis commented 8 months ago

It seems that Google has started sometimes setting source and medium values to google / organic when a visitor clicks an ad.

This breaks the code that we use to fix google / cpc source/medium values that are mis-attributed to google / organic in the stg_ga4__events file.

detect_gclid as (
    select
        * except (event_source, event_medium, event_campaign),
        case
            when (page_location like '%gclid%' and event_source is null) then "google"
            else event_source
        end as event_source,
        case
            when (page_location like '%gclid%' and event_medium is null) then "cpc"
            else event_medium
        end as event_medium,
        case
            when (page_location like '%gclid%' and event_campaign is null) then "(cpc)"
            else event_campaign
        end as event_campaign
    from include_event_key

Here's a graph of raw event counts by day with a page_location containing 'gclid' by medium for one site:

image

This image shows source and medium appearing in the event_params.key fields. The values are truncated to protect private data, but the values for source / medium are 'google / organic'.

image

I've confirmed that this is an issue on older GA4 properties.

I know we don't like adding variables whenever we can avoid it, but I think the fix has to be that we add a variable like use_google_ads_manual_utm_for_attribution and if that's true then we use event_source, event_medium etc... in stg_ga4__events and otherwise default to automatically setting source / medium when a gclid is detected.

We also need to verify that manually tagged UTMs from Google Ads set the source / medium to the tagged values rather than google / organic.

What are your thoughts? Are any of you working on a site that manually tags UTMs from Google Ads?

@adamribaudo @willbryant

dgitis commented 8 months ago

It occurs to me now that I think about it that we don't need a variable.

We can just do something like this:

        case
            when (page_location like '%gclid%' and event_medium is null) then "cpc"
            when (page_location like '%gclid%' and event_medium = 'organic') then "cpc"
            else event_medium
        end as event_medium,
cfrye2 commented 8 months ago

@dgitis in my affected GA4 dataset, I see the campaign value is also set to "organic". Are you seeing similar?

If so, would the resolution then also update the campaign case statement so campaign gets the same conversion to "(cpc)? Something like this?...

detect_gclid as (
    select
        * except (event_source, event_medium, event_campaign),
        case
            when (page_location like '%gclid%' and event_source is null) then "google"
            else event_source
        end as event_source,
        case
            when (page_location like '%gclid%' and event_medium is null) then "cpc"
            when (page_location like '%gclid%' and event_medium = 'organic') then "cpc"
            else event_medium
        end as event_medium,
        case
            when (page_location like '%gclid%' and event_campaign is null) then "(cpc)"
            when (page_location like '%gclid%' and event_campaign = 'organic') then "(cpc)"
            else event_campaign
        end as event_campaign
    from include_event_key
dgitis commented 8 months ago

Yes, my answer was incomplete. Your code should resolve this issue completely.

Feel free to clone the dbt-GA4 repository, create a branch, make your changes, publish the branch and then submit a PR. You'll get credit that way for your contribution. It's a good first contribution.

adamribaudo-velir commented 8 months ago

The fix seems straight-forward, thanks. I'm just a little hung up on whether it's appropriate to override the medium to cpc when the URL has utm_medium=organic. Does anyone know why this behavior changed? Is there any documentation or are there related release notes from Google?

dgitis commented 8 months ago

The issue is with medium=organic and not utm_medium=organic. Before, Google used to have medium=null and then would set the medium to organic based on its server-side rule set.

Now they seem to be moving the attribution logic client-side, so a visit from Google Ads with no UTM codes is sometimes set to null and sometimes set to organic where our CPC fix currently only fixes the when the medium is null. I believe this happened when Google started setting source and medium fields on the session_start and first_visit events.

I do understand your hesitation with overriding UTM codes. This fix would override utm_medium=organic because that will then set medium=organic, but it shouldn't be an issue in practice as no one can add UTM codes to legitimate organic traffic and this only overrides the medium when a gclid is present.

If you want, I can add logic to extract the UTMs in the case statement and use those first, but I think it won't change anything in practice as it seems unlikely that anyone would want to set the UTM to organic on Google Ads traffic.

It occurs to me that the test in the PR needs to be changed to test against original_page_location rather than page location. I'll fix that now.