Closed yamotech closed 7 months ago
@yamotech, thank you for this.
I've done a quick review of the code and it looks fine but I want to test it on a project first. I have a client with six combined GA4 properties who should work.
Pardon my ignorance, but how do I check whether parallel cloning is running in parallel?
Ideally, I'd like to write a Python test to verify this behavior, but I minimally want to confirm that the code works.
Also, why do you nest the combine_specified_property_data
macro in the combine_property_data
macro?
The outer macro basically does nothing other than a for ... in
statement. Is this structure necessary for the parallelism to work?
Minimally, I'll want you to update the macro to use adapter.dispatch
with a default__combine_specified_property_data
. This allows people to override the macro which I think is pretty important in this case.
I've done a quick review of the code and it looks fine but I want to test it on a project first. I have a client with six combined GA4 properties who should work.
Pardon my ignorance, but how do I check whether parallel cloning is running in parallel?
Thanks for the review, @dgitis! I apologize for the lack of explanation, but this pull request does not address parallel cloning.
I worked on parallel cloning in #303, but closed it because the implementation was inadequate.
You can confirm parallel cloning behavior by running $ dbt run -s +base_ga4__events
in #303.
Reasons for poor implementation:
models/staging/base/intermediate/combine_property/int_ga4__combine_property_{{ INDEX }}.sql
file for each configurable property_id.int_ga4__combine_property_{{ INDEX }}
model within base_ga4__events
.This pull request only addresses BugFix
and Fixed timeout in clone operation
.
I have removed the following from Description & motivation
.
Deleted sections are not addressed in this pull request.
If I come up with a better approach than #303, I'll create a new pull request.
### TODO
This pull request does not address it, but the content seems worthwhile for us to tackle.
#### Merges only the data for the specified property_id into base_ga4__events.
For example, "Merges only the data for property_id `000000041` and `000000042` into `base_ga4__events`".
$ dbt run-operation insert_overwrite_properties_data --args '{ property_ids: [000000041, 000000042], start_date: 20210101, end_date: 20240324 }'
To execute `{{ ga4.combine_property_data(property_ids, start_date, end_date) }}` within `insert_overwrite_properties_data`.
We don't need to re-increment the data for property_ids that are already included in base_ga4__events, so I want to address this.
#### Execute combine_property_data in parallel to speed up cloning.
The content I wanted to address in #303.
My dbt project has more than 40 property_ids.
I want to combine these property_ids in parallel to reduce latency.
I'll create a new pull request if I come up with a better approach.
Also, why do you nest the combine_specified_property_data macro in the combine_property_data macro?
The outer macro basically does nothing other than a for ... in statement. Is this structure necessary for the parallelism to work?
I was thinking of cloning only the specified property_id.
$ dbt run-operation combine_specified_property_data --args '{ property_id: 000000100, start_date: 20240101, end_date: 20240325 }'
$ dbt run-operation combine_specified_property_data --args '{ property_id: 000000101, start_date: 20240301, end_date: 20240325 }'
Parallelism to work does not require combine_specified_property_data
. So in the commit 5d829f1437ab4bede1bf7536e7032d94f0899452, I moved the processing to combine_property_data
.
Minimally, I'll want you to update the macro to use adapter.dispatch with a default__combine_specified_property_data. This allows people to override the macro which I think is pretty important in this case.
I removed combine_specified_property_data
so I didn't address that aspect.
My apologies. I understand.
I think I can probably reproduce this by copying the same source table over and over again and test this fix. I've tested some multi-site issues that way before and I know our code doesn't check for the same ID listed more than once in property_ids
.
Description & motivation
Bugfix
Fix #269.
This change greatly reduces the likelihood of an error when specifying a large number of property_ids in
ga4.combine_property_data()
.dbt_project.yml
Merging this pull request will enable execution.
Fixed timeout in clone operation
The following error will almost never occur because I have changed to clone separated by property_id.
https://github.com/Velir/dbt-ga4/blame/6.0.1/README.md#L323-L332
Checklist
dbt test
andpython -m pytest .
to validate existing tests