Velir / dbt-ga4

dbt Package for modeling raw data exported by Google Analytics 4. BigQuery support, only.
MIT License
289 stars 128 forks source link

Different row count between raw data from ga4 and base_ga4__events #297

Closed tegardp closed 5 months ago

tegardp commented 5 months ago

Background: I'm currently using this package to generate session based data based on ga4 events raw data. In order to migrate our company old data marts to use ga4 data. Problem Statement: as the title says. Steps to Reproduce: Run this package and use your own raw data. check and compare count of raw table and base_ga4__events Expected: the rows have same count Actual Behavior: around 200k rows or 0.4% data points are missing.

Any info is there any configuration missing?

I only add config below in my dbt projects

vars:
  ga4:
    source_project: 
    property_ids: 
    start_date: "20240118" # Earliest date to load
    static_incremental_days: 3 # Number of days to scan and reprocess on each run
    query_parameter_exclusions: ["gclid","fbclid","_ga"] 
    query_parameter_extraction: ["gclid","fbclid","keyword"] 
    user_properties:
      - user_property_name: "some_property_name"
         value_type: "string_value"
adamribaudo-velir commented 5 months ago

Hi @tegardp , there is no expectation that the row counts should match. We dedupe events based on the parameters listed here:

https://github.com/Velir/dbt-ga4/blob/032471b85dae9659f38196fd0217868d378fcdd7/models/staging/base/base_ga4__events.sql#L36

So it is normal to have fewer events in the base model than in the raw data.

tegardp commented 5 months ago

yea I'm aware of that line of code.

But actually if I remove that line of code. The value match with the Google Analytics One. Meanwhile, If I keep that line, there's bound to have some discrepancies :/