Closed dgitis closed 8 months ago
Just for reference, the performance additions in 52868e5 process 136.46 MB on a client test site that's been running for over a year compared with 8.8 GB from before the optimization.
The changes reduce the effectiveness when deduplicating, but I prefer to default to the performant option and let people undo that as needed.
The business logic seems sound, but we could use the QUALIFY statement to cut the amount of SQL, remove the 'distinct', and remove a CTE. See here for example: https://github.com/Velir/dbt-ga4/blob/f2f03a2475b3e567a452510e04e01a3464ca59c2/models/staging/base/base_ga4__events.sql#L36C1-L36C26
Something like this I think:
qualify row_number() over(
partition by transaction_id
order by
event_timestamp asc rows between unbounded preceding
and unbounded following
) = 1
Review the new version when you get an chance. It seems to work fine on my test site.
The processing cost on both versions is equal.
Description & motivation
Resolves #193.
This PR deduplicates transactions on
transaction_id
within the current processing window as determined by thestatic_incremental_days
variable.There are a number of possible issues could be fixed here but many of them are in conflict with one another so I chose to keep this as simple as possible.
Checklist
dbt test
andpython -m pytest .
to validate existing tests