Open erikverheij opened 6 months ago
Seems to me like the issue is with the test - at the very least it should include stream_id
. @dgitis ?
I'd like to hear more about why having a page_location set to /
is a problem, @erikverheij . Can you expand on that, ignoring the fct_ga4__pages model for a minute?
Hi @adamribaudo-velir, thanks for your quick reply.
Usually, the page_location
contains a full domain name. I'm not sure how these entries entered our GA4 account.
@erikverheij oh ok. For some reason I thought page_location
didn't include the hostname. It sounds like there's an issue with your data collection then, yes? Maybe that field is being overwritten with page path
in GTM or something?
Regardless, the faulty data has already been collected so I think your best bet is to set the severity of the error you noticed to warn
so that it doesn't block your dbt build https://docs.getdbt.com/reference/resource-configs/severity
We had an issue with multi-site related to this that was recently fixed by @yamotech.
That test should be column_name: "(event_date_dt || stream_id || page_location)"
. I think that was before the last release so updating your packages.yml
file to pull from Git will work until a new release comes out.
packages:
- git: "https://github.com/Velir/dbt-ga4.git"
However, that would not cause the page_location
to be a /
.
This looks to me to be a data collection problem like maybe someone decided to override the page_location
with the page path thus stripping the domain.
Hi @dgitis, thx for the suggestion.
This looks to me to be a data collection problem like maybe someone decided to override the page_location with the page path thus stripping the domain.
Looks something like that indeed.. Is there a way to hook into the flow somewhere at the start or even before the start to remove these entries before transforming with this package?
The failure is at fct_ga4__pages
.
The package is designed for transformation that happen to all events get put in stg_ga4__events
and event-specific transformations happen in the corresponding stg_ga4__event_*
model.
What you should do is override the stg_ga4__events
model and slip in a step that adds the hostname back in to page_location
so that it doesn't error downstream.
To do this, you'll need to create a stg_ga4__events
model in your project models folder (anywhere is fine, but I recommend using folders for organization), copy over the contents of the package stg_ga4__events
file and modify the code to fix page_location
before that field gets processed in other ways.
Then you'll need to disable the package version in your dbt_project.yml
file's models section like this:
models:
ga4:
staging:
stg_ga4__events:
+enabled: false
That code was done from memory, so it may not be exactly correct.
Just an idea @erikverheij , but you could override the 'base select' macro to insert your own logic for parsing page_location
rather than overriding the stg_ga4__events
model https://github.com/Velir/dbt-ga4/blob/main/macros/base_select.sql
Only slightly cleaner, but thought I'd throw it out there.
Thanks guys for providing some ways to workaround it! That will do the trick for me for now.
It would be very nice if there's a way to do some filtering before everything starts, without the need to override logic from the package. I'm not experienced in DBT, but perhaps something like this would be possible?:
Adding a placeholder macro with a WHERE
clause in base_ga4__events that contains no logic in itself but can be overwritten to filter out some specific events.
Note, that this is only a suggestion for the nice-to-have list as the workaround works for me. Such a feature would make it easier to stay aligned with the logic from this repo.
I've definitely thought about adding a custom SQL variable that, if set, adds a with custom_sql as ( {{ var('model_name_custom_sql') }})
block or, more likely some sort of {{ if macro_exists('model_name_custom_sql') }}
logic, but I feel like both of those options are more complex than overriding the existing macros/models using dbt-native mechanisms.
I've encountered an issue with some events in the
base_ga4__events
model where page_location is set to '/' instead of containing a full URL. This issue appears across multiple GA4 properties and leads to failures in the tests defined forfct_ga4__pages
, specifically the uniqueness test on the combination ofpage_location
andevent_date_dt
.Here's the test that's failing:
The failure message is:
To resolve this issue, I'm considering a workaround where records with
page_location = '/'
are removed at an early stage in the transformation process. However, I'm unsure of the best approach to implement this workaround without diverging significantly from the package's intended usage and maintaining upgradeability.Could you provide guidance on how to best address malformed page_location values within the framework of the package? Is there a recommended approach for filtering out or correcting these records before they impact downstream models and tests?
Thank you for your assistance.