Velir / dbt-ga4

dbt Package for modeling raw data exported by Google Analytics 4. BigQuery support, only.
MIT License
290 stars 129 forks source link

Deduplicating transactions #193

Closed dgitis closed 8 months ago

dgitis commented 1 year ago

GA4 by default deduplicates purchases with the same transaction_id. We should probably do the same.

I'm currently chasing down a bunch of duplicate transactions and I'm noticing a mix of implementation issues, users reloading the confirmation page, and bots that seem to want to crawl these pages several months after the fact.

I think, for performance reasons, we should just commit to deduplicating transactions within the period being processed and not look back beyond that.

We could do this in stg_ga4__event_purchase, but we would need to unnest the transaction_id from the items array to do it there. Another option is to create a sample fct_ga4__event_purchase table that people can enable and fix it there.

We may also want to consider removing the duplicates from the items table.

Additionally, purchase events where transaction_id is null should not be tracked as these are invalid according to Google, but they come through to the BigQuery export and currently get processed by our package.

Regardless of what we choose, I think we want to keep the purchase with the oldest timestamp.

I think I do the most ecommerce work, but I'd like @adamribaudo and @willbryant to weigh in here. Adam, in particular, has preferred to keep our event staging models consistent and one of the options here involves quite a few changes to one of those models.

adamribaudo-velir commented 1 year ago

@dgitis quick note: the transaction_id is part of the normal event params, not the items array. We already unnest it in stg_ga4__event_purchase.sql.

Seems like it would be straight-forward enough to create a stg_ga4__event_purchase_deduped model on top that uses a window function over the incremental load to dedupe. That sound good?

dgitis commented 1 year ago

I presume that we would put this in the recommended-events folder and set it to disabled by default and then adjust the recommended events documentation accordingly.

I think we should do this for the stg_ga4__event_items model as well possibly using a check for whether the deduped table exists and use the deduped table for purchase items if it does while leaving all other ecommerce event items unchanged.

dgitis commented 1 year ago

I'm going to edit this issue and add one removing transaction_id is null. Those are invalid purchase events according to Google, but they seem to come through the raw data export just fine.

dgitis commented 8 months ago

I'll be working on this, finally, this week.