Dedup transactions - Githubissues

Velir / dbt-ga4

dbt Package for modeling raw data exported by Google Analytics 4. BigQuery support, only.

MIT License

289 stars 128 forks source link

Dedup transactions #277

Closed dgitis closed 8 months ago

dgitis commented 8 months ago

Description & motivation

Resolves #193.

This PR deduplicates transactions on transaction_id within the current processing window as determined by the static_incremental_days variable.

There are a number of possible issues could be fixed here but many of them are in conflict with one another so I chose to keep this as simple as possible.

Checklist

[y ] I have verified that these changes work locally
[ y] I have updated the README.md (if applicable)
[n/a ] I have added tests & descriptions to my models (and macros if applicable)
[ y] I have run dbt test and python -m pytest . to validate existing tests

dgitis commented 8 months ago

Just for reference, the performance additions in 52868e5 process 136.46 MB on a client test site that's been running for over a year compared with 8.8 GB from before the optimization.

The changes reduce the effectiveness when deduplicating, but I prefer to default to the performant option and let people undo that as needed.

adamribaudo-velir commented 8 months ago

The business logic seems sound, but we could use the QUALIFY statement to cut the amount of SQL, remove the 'distinct', and remove a CTE. See here for example: https://github.com/Velir/dbt-ga4/blob/f2f03a2475b3e567a452510e04e01a3464ca59c2/models/staging/base/base_ga4__events.sql#L36C1-L36C26

Something like this I think:

qualify row_number() over(
 partition by transaction_id 
        order by 
            event_timestamp asc rows between unbounded preceding
            and unbounded following
) = 1

dgitis commented 8 months ago

Review the new version when you get an chance. It seems to work fine on my test site.

The processing cost on both versions is equal.