Velir / dbt-ga4

dbt Package for modeling raw data exported by Google Analytics 4. BigQuery support, only.
MIT License
312 stars 134 forks source link

Handling null ga_session_id in event_key creation #340

Open EvavW opened 1 month ago

EvavW commented 1 month ago

When ga_session_id is null, so is event_key. This is problematic when using event_key to count unique events.

Null event_keys could be avoided by using array_to_string to join keys rather than concat. such as:

include_session_key as (
    select 
        *,
        to_base64(md5(array_to_string([client_key, cast(session_id as string)], ""))) as session_key
    from include_client_key
),

This should generate the same keys as the current approach when ga_session_id is not null. This same approach could also handle null user_pseudo_ids. Are there any concerns with this approach?

dgitis commented 1 month ago

At first glance, I'm on board with this change.

We would also need to filter out the formerly null session keys from the down-stream session tables so that metrics don't shift suddenly or add a configuration that defaults to filtering these out.

For consistencies sake, we probably want to filter out events with the new null event keys as well and add a default configuration here too.

Longer term, we should think about adding modeling options or instructions to the package similar to GA4's data modeling that would use these "null" key events and sessions.

I would imagine that the fct_ga4__pages model would work well as a sample, how-to-model missing data, summing metrics from stg_ga4__event_page_view and stg_ga4__event_page_view_modeled by page_location.

I don't actually know how to do the actual modeling? Is there a statistical function in BQ that does this already?

We don't actually need to show modeling in the package but I do think that the package should give people the option to model their data and that means that we need to handle this null issue with that in mind.

@adamribaudo-velir do you have any thoughts on handling null ga_session_id.

dgitis commented 1 month ago

Actually, it occurs to me that the modeled metrics would be a simple calculation of figuring out the metric value per event or per session, multiplying that value by the number of non consented events or sessions and then adding that to the actual metric: (sum(consented_metric_value)/count(consented_rows)) * count(non_consented_rows) + sum(consented_metric_value).

This makes sense to do as a macro.

Also, instead of having two staging models, like in my last comment, we include the consented and non-consented rows in the same model and then filter out the non-consented in our marts unless someone configures the filter variable to include non-consented data.

I see this as being a two stage change.

In stage 1, we implement @EvavW's suggestion and add a process_non_consented_data variable that defaults to false and then conditionally include or filter these non-consented values at the end of stg_ga4__events.

In stage 2, we create the model_counting_metrics macro and update the fact tables, sessions and pages, to use the macro where appropriate if process_non_consented_data is true.

I want to separate the second stage because we have a lot of logic in our session staging tables that filters out null session_key. I think with the stage 1 modification, these null session_key rows will end up getting processed and I think that is what we want, but I just want to be sure. For example, we may want to add conditional logic to our session models to set event, session, and last, non-direct source, medium etc... for non-consented sessions to null instead of running those through any window functions.

EvavW commented 3 weeks ago

Thanks for your response! I like the approach of configuring whether to include previously null sessions. On the question of naming, not every instance of a null session_key is caused by non-consent. I think that a null ga_session_id is a separate issue, for instance user_pseudo_id is not always null when ga_session_id is null.

Could you explain your thinking a bit more on why we would need a special macro to calculate metric value when including previously-null sessions? For a given metric there should be values populated for previously null sessions, so why would we need to derive metric value for previously null sessions from non-null session metric values?

dgitis commented 3 weeks ago

Do you know what causes ga_session_id and user_pseudo_id to be null? And which combinations of null values indicates non-consent and other causes?

I was thinking to mimic the behavioral modeling feature in GA4. This is surplus to the scope of the issue that you've presented here.

I'm thinking maybe this solution:

include_session_key as (
    select 
        *,
        to_base64(md5(array_to_string([client_key, cast(session_id as string)], ""))) as session_key
    from include_client_key
),
-- Add key that captures a combination of stream_id and user_pseudo_id to uniquely identify a 'client' (aka. a device) within a single stream
include_client_key as (
    select *
    , case
          when user_pseudo_id is null then 'null_user_pseudo_id'
          else  to_base64(md5(concat(user_pseudo_id, stream_id))) 
      end as client_key
    from base_events
),
-- Add key for sessions. session_key will be null if client_key is null due to consent being denied. ga_session_id may be null during audience trigger events. 
include_session_key as (
    select 
        *,
        case
            when session_id is null and client_key = 'null_user_pseudo_id' then 'null_session_id_and_user_pseudo_id'
            when session_id is null then 'null_session_id'
            when client_key = 'null_user_pseudo_id' then 'null_user_pseudo_id'
            else to_base64(md5(CONCAT(client_key, CAST(session_id as STRING)))) as session_key
    from include_client_key
),

This way data users can clearly identify the IDs that are null by the client and session key values.

Then, we add a variable that controls whether to include these values.

Do you want to code this up and get your name added to the contributor list @EvavW?

Any modeling stuff we will keep separate.

EvavW commented 3 weeks ago

Sure I'd be happy to open a PR for this.

I'm not sure why ga_session_id is sometimes null. This is something we're investigating, I'll reply back here if we come up with anything.

I'm rethinking this after looking at your solution. Populating session_key with those strings would result in duplicated session_partition_key across sessions which might be even more confusing than leaving them null. I realize now that we would have the same problem (just even more obfuscated) if we generated session_key or client_key with array_to_string where ga_session_id or user_pseudo_id are null. We can't really reliably generate session-level metrics for sessions with a null user_pseudo_id or a null ga_session_id if we can't tell them apart from each other. But, we can at least fix this problem at the event level.

We could leave session_key and client_key as they are and just start using array_to_string to derive event_key. This would prevent duplicated session and client keys and also populate additional unique event keys that were previously null:

-- Add key that captures a combination of stream_id and user_pseudo_id to uniquely identify a 'client' (aka. a device) within a single stream
include_client_key as (
    select *
    , to_base64(md5(concat(user_pseudo_id, stream_id))) as client_key
    from base_events
),
-- Add key for sessions. session_key will be null if client_key is null due to consent being denied. ga_session_id may be null during audience trigger events. 
include_session_key as (
    select 
        *,
        to_base64(md5(CONCAT(client_key, CAST(session_id as STRING)))) as session_key
    from include_client_key
),
-- Add a key that combines session key and date. Useful when working with session table within date-partitioned tables
include_session_partition_key as (
    select 
        *,
        CONCAT(session_key, CAST(event_date_dt as STRING)) as session_partition_key
    from include_session_key
),
-- Add unique key for events
include_event_key as (
    select 
        *,
        to_base64(md5(array_to_string([session_key, event_name, CAST(event_timestamp as STRING), to_json_string(event_params)], ""))) as event_key -- Surrogate key for unique events.  
    from include_session_partition_key
),
dgitis commented 3 weeks ago

Populating session_key with those strings would result in duplicated session_partition_key across sessions which might be even more confusing than leaving them null.

The various session models include some where session_key is not null or similar logic. We would need to update that to exclude these new null replacements.

My solution follows a best practice to replace null values with explanatory values however the one that you showed does not require any meddling with the session models and storing null values is free in BQ.

I'm fine with either solution. There are good-enough reasons to choose either.

We need two types of logic when modeling the missing data. Event-scope logic and session-scope logic.

Writing the two out below, it seems like for event-scope logic we want to have the rows with null-replaced keys, but for sessions, we don't need rows with null-replaced keys.

Events

This model is built on top of the existing fact and dim tables or possibly as a replacement for fct_ga4__pages.

with pv as (
   select * from {{ref('fct_ga4__event_page_view')}}
)
, consented_values as (
  select
    page_location
    , event_date_dt
    , count(*) as page_views
    , sum(page_value_in_usd) as page_value
    // insert more metrics here as appropriate
  from pv
  where session_key not in ('null_session_id_and_user_pseudo_id', 'null_session_id', 'null_user_pseudo_id') 
  group by 1,2
)
, non_consented_values as (
  select * 
    page_location
    , event_date_dt
    , count(*) as non_consented_page_views
    // for modeling events we only need to count of non-consented events to calculate metrics
  from pv
  where session_key in ('null_session_id_and_user_pseudo_id', 'null_session_id', 'null_user_pseudo_id') 
  group by 1,2
)
, modeled_values as (
  select
    cv.page_location
    , cv.event_date_dt
    // below should be macro with this calc 
    // model_counting_metrics( consented_metrics_value, consented_metrics_row_count, non_consented_rows)
    // (sum(consented_metric_value)/count(consented_rows)) * count(non_consented_rows) + sum(consented_metric_value)
    , model_counting_metrics(cv.page_value, cv.page_views, ncv.non_consented_page_views

  from consented_values as cv
  left join join non_consented_values as ncv using(page_location, event_date_dt)
  )
select * from modeled_values

Sessions

For sessions, we just need to count consented sessions and count consented events to get a events_per_session and then use inverse of that to calculate modeled session total based on events_per_session and total of consented and non-consented events.

This model is built on top of the existing fact and dim tables and requires a new daily total metrics table for the above calculations.

with consented_values as (
  select
    session_start_date
    , session_source
    , session_medium
    , session_default_channel_grouping
    , count(*) as sessions
    , sum(purchase_count) as purchases
  from ses
  // may not be necessary depending on what we do in stg_ga4__events
  where session_key not in ('null_session_id_and_user_pseudo_id', 'null_session_id', 'null_user_pseudo_id') 
)
, daily as (
  select
    // we need estimate the number of sessions based on the number of events
    // because this calculation happens here, we can exclude the null session keys from upstream session models
    // they are currently being excluded with a `where session_key is not null` clause
    // we will need to update that
    // the agg_ga4__daily model does not exist in the package; I usually create it for clients
     total_daily_consented_sessions
     , total_daily_consented_sessions * (total_daily_page_views / total_daily_consented_page_views) as modeled_total_daily_sessions
  from {{ ref('agg_ga4__daily') }}
select 
    cv.session_start_date
    , cv.session_source
    , cv.session_medium
    , cv.session_default_channel_grouping
    , cv.sessions * (d.modeled_total_daily_sessions / d.total_daily_consented_sessions) as modeled_sessions
    , cv.purchases * (d.modeled_total_daily_sessions / d.total_daily_consented_sessions) as modeled_purchases
from consented_values as cv
full outer join daily as d

The above code is rough and only meant to demonstrate how you would use the 'null' values in downstream models.

EvavW commented 3 weeks ago

My solution follows a best practice to replace null values with explanatory values

Thanks for that explanation, I'll look out for other explanatory values in the package.

On the investigative side, we're still looking into potential causes of missing ga_session_id but we believe tentatively that it is actually related to consent as well. Another note is that at least in our case, we only have missing ga_session_ids in the Firebase stream. We are looking into the possibility that there is a better parameter specific to Firebase that might show up more consistently than ga_session_id.

While we still try to figure all that out, I'll open a PR for the solution that only fills in nulls in event_key. When we have some answers I could check back in about a potential second PR to put in some explanatory values for client_key and session_key that we feel confident about.

EvavW commented 3 weeks ago

@dgitis Also, could you grant me permission to contribute to the repo?

dgitis commented 3 weeks ago

Contributions are blocked so that changes can be reviewed properly.

Just create the PR, run the various test suites, and then request for me and/or @adamribaudo-velir to review it.

Once the PR has been approved, Adam will merge it into main.

dgitis commented 3 weeks ago

And I should point out that replacing null values with explanatory values is a best practice that we don't currently follow which is one of the reasons that I'm fine with you ignoring it but is also why I was asking which circumstances produced different combinations of null values.

Ideally, we would have values like consent_mode_consent_denied, or event_fired_before_config_tag. But, if we're not certain of the cause or causes, then the more generic values that I gave are better than potentially misleading, specific values.

EvavW commented 3 weeks ago

I'm just trying to push to a new branch, not to main

(venv) Best-Boy:dbt-ga4 vonwele$ git push --set-upstream origin issue_340_fill_null_event_key
remote: Permission to Velir/dbt-ga4.git denied to EvavW.
fatal: unable to access 'https://github.com/Velir/dbt-ga4.git/': The requested URL returned error: 403

I haven't contributed to a public repo before so I might be doing something wrong here

EvavW commented 3 weeks ago

@dgitis Any idea of what I'm missing to be able to open a PR?

adamribaudo-velir commented 3 weeks ago

@EvavW you'll need to fork the repo into your account first, create a branch, and then when you create the PR you should have the option to submit it against this repo

Some notes here: https://stackoverflow.com/questions/77996542/submitting-a-pull-request-for-original-repository

I've had it on my TODO to weigh in on this thread. Sorry for the delay.

EvavW commented 3 weeks ago

@adamribaudo-velir thank you!

EvavW commented 3 weeks ago

Opened https://github.com/Velir/dbt-ga4/pull/341

adamribaudo-velir commented 3 weeks ago

PR looks good.

I think modeling should be outside of the scope of this project, @dgitis. It's too opinionated. There's no "right" way to model missing data and it's very context-dependent. I'm open to adding anything to the 'analysis' folder (which is never actually run by dbt) as examples, though. BigQuery has some nice methods of generating ML models from SQL.