Open EvavW opened 1 month ago
At first glance, I'm on board with this change.
We would also need to filter out the formerly null session keys from the down-stream session tables so that metrics don't shift suddenly or add a configuration that defaults to filtering these out.
For consistencies sake, we probably want to filter out events with the new null event keys as well and add a default configuration here too.
Longer term, we should think about adding modeling options or instructions to the package similar to GA4's data modeling that would use these "null" key events and sessions.
I would imagine that the fct_ga4__pages
model would work well as a sample, how-to-model missing data, summing metrics from stg_ga4__event_page_view
and stg_ga4__event_page_view_modeled
by page_location
.
I don't actually know how to do the actual modeling? Is there a statistical function in BQ that does this already?
We don't actually need to show modeling in the package but I do think that the package should give people the option to model their data and that means that we need to handle this null issue with that in mind.
@adamribaudo-velir do you have any thoughts on handling null ga_session_id
.
Actually, it occurs to me that the modeled metrics would be a simple calculation of figuring out the metric value per event or per session, multiplying that value by the number of non consented events or sessions and then adding that to the actual metric: (sum(consented_metric_value)/count(consented_rows)) * count(non_consented_rows) + sum(consented_metric_value)
.
This makes sense to do as a macro.
Also, instead of having two staging models, like in my last comment, we include the consented and non-consented rows in the same model and then filter out the non-consented in our marts unless someone configures the filter variable to include non-consented data.
I see this as being a two stage change.
In stage 1, we implement @EvavW's suggestion and add a process_non_consented_data
variable that defaults to false and then conditionally include or filter these non-consented values at the end of stg_ga4__events
.
In stage 2, we create the model_counting_metrics
macro and update the fact tables, sessions and pages, to use the macro where appropriate if process_non_consented_data
is true.
I want to separate the second stage because we have a lot of logic in our session staging tables that filters out null session_key
. I think with the stage 1 modification, these null session_key
rows will end up getting processed and I think that is what we want, but I just want to be sure. For example, we may want to add conditional logic to our session models to set event, session, and last, non-direct source, medium etc... for non-consented sessions to null instead of running those through any window functions.
Thanks for your response! I like the approach of configuring whether to include previously null sessions. On the question of naming, not every instance of a null session_key
is caused by non-consent. I think that a null ga_session_id
is a separate issue, for instance user_pseudo_id
is not always null when ga_session_id
is null.
Could you explain your thinking a bit more on why we would need a special macro to calculate metric value when including previously-null sessions? For a given metric there should be values populated for previously null sessions, so why would we need to derive metric value for previously null sessions from non-null session metric values?
Do you know what causes ga_session_id
and user_pseudo_id
to be null? And which combinations of null values indicates non-consent and other causes?
I was thinking to mimic the behavioral modeling feature in GA4. This is surplus to the scope of the issue that you've presented here.
I'm thinking maybe this solution:
include_session_key as (
select
*,
to_base64(md5(array_to_string([client_key, cast(session_id as string)], ""))) as session_key
from include_client_key
),
-- Add key that captures a combination of stream_id and user_pseudo_id to uniquely identify a 'client' (aka. a device) within a single stream
include_client_key as (
select *
, case
when user_pseudo_id is null then 'null_user_pseudo_id'
else to_base64(md5(concat(user_pseudo_id, stream_id)))
end as client_key
from base_events
),
-- Add key for sessions. session_key will be null if client_key is null due to consent being denied. ga_session_id may be null during audience trigger events.
include_session_key as (
select
*,
case
when session_id is null and client_key = 'null_user_pseudo_id' then 'null_session_id_and_user_pseudo_id'
when session_id is null then 'null_session_id'
when client_key = 'null_user_pseudo_id' then 'null_user_pseudo_id'
else to_base64(md5(CONCAT(client_key, CAST(session_id as STRING)))) as session_key
from include_client_key
),
This way data users can clearly identify the IDs that are null by the client and session key values.
Then, we add a variable that controls whether to include these values.
Do you want to code this up and get your name added to the contributor list @EvavW?
Any modeling stuff we will keep separate.
Sure I'd be happy to open a PR for this.
I'm not sure why ga_session_id
is sometimes null. This is something we're investigating, I'll reply back here if we come up with anything.
I'm rethinking this after looking at your solution. Populating session_key
with those strings would result in duplicated session_partition_key
across sessions which might be even more confusing than leaving them null. I realize now that we would have the same problem (just even more obfuscated) if we generated session_key
or client_key
with array_to_string
where ga_session_id
or user_pseudo_id
are null.
We can't really reliably generate session-level metrics for sessions with a null user_pseudo_id
or a null ga_session_id
if we can't tell them apart from each other. But, we can at least fix this problem at the event level.
We could leave session_key
and client_key
as they are and just start using array_to_string
to derive event_key
. This would prevent duplicated session and client keys and also populate additional unique event keys that were previously null:
-- Add key that captures a combination of stream_id and user_pseudo_id to uniquely identify a 'client' (aka. a device) within a single stream
include_client_key as (
select *
, to_base64(md5(concat(user_pseudo_id, stream_id))) as client_key
from base_events
),
-- Add key for sessions. session_key will be null if client_key is null due to consent being denied. ga_session_id may be null during audience trigger events.
include_session_key as (
select
*,
to_base64(md5(CONCAT(client_key, CAST(session_id as STRING)))) as session_key
from include_client_key
),
-- Add a key that combines session key and date. Useful when working with session table within date-partitioned tables
include_session_partition_key as (
select
*,
CONCAT(session_key, CAST(event_date_dt as STRING)) as session_partition_key
from include_session_key
),
-- Add unique key for events
include_event_key as (
select
*,
to_base64(md5(array_to_string([session_key, event_name, CAST(event_timestamp as STRING), to_json_string(event_params)], ""))) as event_key -- Surrogate key for unique events.
from include_session_partition_key
),
Populating session_key with those strings would result in duplicated session_partition_key across sessions which might be even more confusing than leaving them null.
The various session models include some where session_key is not null
or similar logic. We would need to update that to exclude these new null replacements.
My solution follows a best practice to replace null values with explanatory values however the one that you showed does not require any meddling with the session models and storing null values is free in BQ.
I'm fine with either solution. There are good-enough reasons to choose either.
We need two types of logic when modeling the missing data. Event-scope logic and session-scope logic.
Writing the two out below, it seems like for event-scope logic we want to have the rows with null-replaced keys, but for sessions, we don't need rows with null-replaced keys.
This model is built on top of the existing fact and dim tables or possibly as a replacement for fct_ga4__pages
.
with pv as (
select * from {{ref('fct_ga4__event_page_view')}}
)
, consented_values as (
select
page_location
, event_date_dt
, count(*) as page_views
, sum(page_value_in_usd) as page_value
// insert more metrics here as appropriate
from pv
where session_key not in ('null_session_id_and_user_pseudo_id', 'null_session_id', 'null_user_pseudo_id')
group by 1,2
)
, non_consented_values as (
select *
page_location
, event_date_dt
, count(*) as non_consented_page_views
// for modeling events we only need to count of non-consented events to calculate metrics
from pv
where session_key in ('null_session_id_and_user_pseudo_id', 'null_session_id', 'null_user_pseudo_id')
group by 1,2
)
, modeled_values as (
select
cv.page_location
, cv.event_date_dt
// below should be macro with this calc
// model_counting_metrics( consented_metrics_value, consented_metrics_row_count, non_consented_rows)
// (sum(consented_metric_value)/count(consented_rows)) * count(non_consented_rows) + sum(consented_metric_value)
, model_counting_metrics(cv.page_value, cv.page_views, ncv.non_consented_page_views
from consented_values as cv
left join join non_consented_values as ncv using(page_location, event_date_dt)
)
select * from modeled_values
For sessions, we just need to count consented sessions and count consented events to get a events_per_session and then use inverse of that to calculate modeled session total based on events_per_session and total of consented and non-consented events.
This model is built on top of the existing fact and dim tables and requires a new daily total metrics table for the above calculations.
with consented_values as (
select
session_start_date
, session_source
, session_medium
, session_default_channel_grouping
, count(*) as sessions
, sum(purchase_count) as purchases
from ses
// may not be necessary depending on what we do in stg_ga4__events
where session_key not in ('null_session_id_and_user_pseudo_id', 'null_session_id', 'null_user_pseudo_id')
)
, daily as (
select
// we need estimate the number of sessions based on the number of events
// because this calculation happens here, we can exclude the null session keys from upstream session models
// they are currently being excluded with a `where session_key is not null` clause
// we will need to update that
// the agg_ga4__daily model does not exist in the package; I usually create it for clients
total_daily_consented_sessions
, total_daily_consented_sessions * (total_daily_page_views / total_daily_consented_page_views) as modeled_total_daily_sessions
from {{ ref('agg_ga4__daily') }}
select
cv.session_start_date
, cv.session_source
, cv.session_medium
, cv.session_default_channel_grouping
, cv.sessions * (d.modeled_total_daily_sessions / d.total_daily_consented_sessions) as modeled_sessions
, cv.purchases * (d.modeled_total_daily_sessions / d.total_daily_consented_sessions) as modeled_purchases
from consented_values as cv
full outer join daily as d
The above code is rough and only meant to demonstrate how you would use the 'null' values in downstream models.
My solution follows a best practice to replace null values with explanatory values
Thanks for that explanation, I'll look out for other explanatory values in the package.
On the investigative side, we're still looking into potential causes of missing ga_session_id but we believe tentatively that it is actually related to consent as well. Another note is that at least in our case, we only have missing ga_session_ids in the Firebase stream. We are looking into the possibility that there is a better parameter specific to Firebase that might show up more consistently than ga_session_id.
While we still try to figure all that out, I'll open a PR for the solution that only fills in nulls in event_key. When we have some answers I could check back in about a potential second PR to put in some explanatory values for client_key and session_key that we feel confident about.
@dgitis Also, could you grant me permission to contribute to the repo?
Contributions are blocked so that changes can be reviewed properly.
Just create the PR, run the various test suites, and then request for me and/or @adamribaudo-velir to review it.
Once the PR has been approved, Adam will merge it into main.
And I should point out that replacing null values with explanatory values is a best practice that we don't currently follow which is one of the reasons that I'm fine with you ignoring it but is also why I was asking which circumstances produced different combinations of null values.
Ideally, we would have values like consent_mode_consent_denied
, or event_fired_before_config_tag
. But, if we're not certain of the cause or causes, then the more generic values that I gave are better than potentially misleading, specific values.
I'm just trying to push to a new branch, not to main
(venv) Best-Boy:dbt-ga4 vonwele$ git push --set-upstream origin issue_340_fill_null_event_key
remote: Permission to Velir/dbt-ga4.git denied to EvavW.
fatal: unable to access 'https://github.com/Velir/dbt-ga4.git/': The requested URL returned error: 403
I haven't contributed to a public repo before so I might be doing something wrong here
@dgitis Any idea of what I'm missing to be able to open a PR?
@EvavW you'll need to fork the repo into your account first, create a branch, and then when you create the PR you should have the option to submit it against this repo
Some notes here: https://stackoverflow.com/questions/77996542/submitting-a-pull-request-for-original-repository
I've had it on my TODO to weigh in on this thread. Sorry for the delay.
@adamribaudo-velir thank you!
PR looks good.
I think modeling should be outside of the scope of this project, @dgitis. It's too opinionated. There's no "right" way to model missing data and it's very context-dependent. I'm open to adding anything to the 'analysis' folder (which is never actually run by dbt) as examples, though. BigQuery has some nice methods of generating ML models from SQL.
When
ga_session_id
is null, so isevent_key
. This is problematic when usingevent_key
to count unique events.Null
event_key
s could be avoided by usingarray_to_string
to join keys rather thanconcat
. such as:This should generate the same keys as the current approach when
ga_session_id
is not null. This same approach could also handle nulluser_pseudo_id
s. Are there any concerns with this approach?