Velir / dbt-ga4

dbt Package for modeling raw data exported by Google Analytics 4. BigQuery support, only.
MIT License
312 stars 134 forks source link

Workaround for Malformed Events with page_location = '/' in fct_ga4__pages #311

Open erikverheij opened 6 months ago

erikverheij commented 6 months ago

I've encountered an issue with some events in the base_ga4__events model where page_location is set to '/' instead of containing a full URL. This issue appears across multiple GA4 properties and leads to failures in the tests defined for fct_ga4__pages, specifically the uniqueness test on the combination of page_location and event_date_dt.

Here's the test that's failing:

# models/marts/core/fct_ga4__pages.yml
# ... 
tests:
  - unique:
      column_name: "(page_location || event_date_dt)"

The failure message is:

Failure in test unique_fct_ga4__pages__page_location_event_date_dt_ (models/marts/core/fct_ga4__pages.yml)

To resolve this issue, I'm considering a workaround where records with page_location = '/' are removed at an early stage in the transformation process. However, I'm unsure of the best approach to implement this workaround without diverging significantly from the package's intended usage and maintaining upgradeability.

Could you provide guidance on how to best address malformed page_location values within the framework of the package? Is there a recommended approach for filtering out or correcting these records before they impact downstream models and tests?

Thank you for your assistance.

adamribaudo-velir commented 6 months ago

Seems to me like the issue is with the test - at the very least it should include stream_id. @dgitis ?

I'd like to hear more about why having a page_location set to / is a problem, @erikverheij . Can you expand on that, ignoring the fct_ga4__pages model for a minute?

erikverheij commented 6 months ago

Hi @adamribaudo-velir, thanks for your quick reply.

Usually, the page_location contains a full domain name. I'm not sure how these entries entered our GA4 account.

adamribaudo-velir commented 6 months ago

@erikverheij oh ok. For some reason I thought page_location didn't include the hostname. It sounds like there's an issue with your data collection then, yes? Maybe that field is being overwritten with page path in GTM or something?

Regardless, the faulty data has already been collected so I think your best bet is to set the severity of the error you noticed to warn so that it doesn't block your dbt build https://docs.getdbt.com/reference/resource-configs/severity

dgitis commented 6 months ago

We had an issue with multi-site related to this that was recently fixed by @yamotech.

That test should be column_name: "(event_date_dt || stream_id || page_location)". I think that was before the last release so updating your packages.yml file to pull from Git will work until a new release comes out.

packages:
  - git: "https://github.com/Velir/dbt-ga4.git"

However, that would not cause the page_location to be a /.

This looks to me to be a data collection problem like maybe someone decided to override the page_location with the page path thus stripping the domain.

erikverheij commented 6 months ago

Hi @dgitis, thx for the suggestion.

This looks to me to be a data collection problem like maybe someone decided to override the page_location with the page path thus stripping the domain.

Looks something like that indeed.. Is there a way to hook into the flow somewhere at the start or even before the start to remove these entries before transforming with this package?

dgitis commented 6 months ago

The failure is at fct_ga4__pages.

The package is designed for transformation that happen to all events get put in stg_ga4__events and event-specific transformations happen in the corresponding stg_ga4__event_* model.

What you should do is override the stg_ga4__events model and slip in a step that adds the hostname back in to page_location so that it doesn't error downstream.

To do this, you'll need to create a stg_ga4__events model in your project models folder (anywhere is fine, but I recommend using folders for organization), copy over the contents of the package stg_ga4__events file and modify the code to fix page_location before that field gets processed in other ways.

Then you'll need to disable the package version in your dbt_project.yml file's models section like this:

models:
  ga4:
    staging:
      stg_ga4__events:
        +enabled: false

That code was done from memory, so it may not be exactly correct.

adamribaudo-velir commented 6 months ago

Just an idea @erikverheij , but you could override the 'base select' macro to insert your own logic for parsing page_location rather than overriding the stg_ga4__events model https://github.com/Velir/dbt-ga4/blob/main/macros/base_select.sql

Only slightly cleaner, but thought I'd throw it out there.

erikverheij commented 5 months ago

Thanks guys for providing some ways to workaround it! That will do the trick for me for now.

It would be very nice if there's a way to do some filtering before everything starts, without the need to override logic from the package. I'm not experienced in DBT, but perhaps something like this would be possible?:

Adding a placeholder macro with a WHERE clause in base_ga4__events that contains no logic in itself but can be overwritten to filter out some specific events.

Note, that this is only a suggestion for the nice-to-have list as the workaround works for me. Such a feature would make it easier to stay aligned with the logic from this repo.

dgitis commented 5 months ago

I've definitely thought about adding a custom SQL variable that, if set, adds a with custom_sql as ( {{ var('model_name_custom_sql') }}) block or, more likely some sort of {{ if macro_exists('model_name_custom_sql') }} logic, but I feel like both of those options are more complex than overriding the existing macros/models using dbt-native mechanisms.