Velir / dbt-ga4

dbt Package for modeling raw data exported by Google Analytics 4. BigQuery support, only.
MIT License
312 stars 134 forks source link

Fix error when setting a large number of properties #312

Closed yamotech closed 5 months ago

yamotech commented 6 months ago

Description & motivation

Bugfix

Fix #269.

This change greatly reduces the likelihood of an error when specifying a large number of property_ids in ga4.combine_property_data().

dbt_project.yml

vars:
    ga4:
        source_project: source-project-id
        property_ids: [
          000000001
          , 000000002
          , ...
          , 000000040
        ]
        start_date: 20210101
        static_incremental_days: 3
        combined_dataset: combined_dataset_name
$ dbt run -s base_ga4__events --full-refresh
06:51:19  Running with dbt=1.5.0
06:52:05  Found 999 models, 999 tests, 999 snapshots, 999 analyses, 999 macros, 999 operations, 999 seed files, 999 sources, 999 exposures, 999 metrics, 999 groups
06:52:06
06:52:14  Concurrency: 4 threads (target='dev')
06:52:14
06:52:14  1 of 1 START sql view model dataset_name.base_ga4__events ......... [RUN]
06:56:17  BigQuery adapter: https://console.cloud.google.com/bigquery?project=project-id&j=bq:asia-northeast1:????????-????-????-????-????????????&page=queryresults
06:56:17  1 of 1 ERROR creating sql view model dataset_name.base_ga4__events  [ERROR in 243.80s]
06:56:18
06:56:18  Finished running 1 view model in 0 hours 4 minutes and 11.62 seconds (251.62s).
06:56:22
06:56:22  Completed with 1 error and 0 warnings:
06:56:22
06:56:23  Database Error in model base_ga4__events (models/staging/base/base_ga4__events.sql)
06:56:23    The query is too large. The maximum standard SQL query length is 1024.00K characters, including comments and white space characters.
06:56:23
06:56:23  Done. PASS=0 WARN=0 ERROR=1 SKIP=0 TOTAL=1

Merging this pull request will enable execution.

$ dbt run -s base_ga4__events --full-refresh
HH:mm:ss  Running with dbt=1.5.0
HH:mm:ss  Found 999 models, 999 tests, 999 snapshots, 999 analyses, 999 macros, 999 operations, 999 seed files, 999 sources, 999 exposures, 999 metrics, 999 groups
HH:mm:ss
HH:mm:ss  Concurrency: 4 threads (target='dev')
HH:mm:ss
HH:mm:ss  1 of 1 START sql incremental model dataset_name.base_ga4__events ... [RUN]
HH:mm:ss  Cloned from `source-project-id.analytics_000000001.events_*` to `project-id.combined_dataset_name.events_YYYYMMDD000000001`.
HH:mm:ss  Cloned from `source-project-id.analytics_000000002.events_*` to `project-id.combined_dataset_name.events_YYYYMMDD000000002`.
....
HH:mm:ss  Cloned from `source-project-id.analytics_000000040.events_*` to `project-id.combined_dataset_name.events_YYYYMMDD000000040`.
HH:mm:ss  1 of 1 OK created sql incremental model dataset_name.base_ga4__events  [CREATE TABLE (? rows, ? processed) in ?]
HH:mm:ss
HH:mm:ss  Finished running 1 incremental model in ? (?).
HH:mm:ss
HH:mm:ss  Completed successfully
HH:mm:ss
HH:mm:ss  Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1

Fixed timeout in clone operation

The following error will almost never occur because I have changed to clone separated by property_id.

https://github.com/Velir/dbt-ga4/blame/6.0.1/README.md#L323-L332

Jobs that run a large number of clone operations are prone to timing out. As a result, it is recommended that you increase the query timeout if you need to backfill or full-refresh the table, when first setting up or when the base model gets modified. Otherwise, it is best to prevent the base model from rebuilding on full refreshes unless needed to minimize timeouts.

models:
  ga4:
    staging:
      base:
        base_ga4__events:
          +full_refresh: false

Checklist

dgitis commented 6 months ago

@yamotech, thank you for this.

I've done a quick review of the code and it looks fine but I want to test it on a project first. I have a client with six combined GA4 properties who should work.

Pardon my ignorance, but how do I check whether parallel cloning is running in parallel?

Ideally, I'd like to write a Python test to verify this behavior, but I minimally want to confirm that the code works.

Also, why do you nest the combine_specified_property_data macro in the combine_property_data macro?

The outer macro basically does nothing other than a for ... in statement. Is this structure necessary for the parallelism to work?

Minimally, I'll want you to update the macro to use adapter.dispatch with a default__combine_specified_property_data. This allows people to override the macro which I think is pretty important in this case.

yamotech commented 6 months ago

I've done a quick review of the code and it looks fine but I want to test it on a project first. I have a client with six combined GA4 properties who should work.

Pardon my ignorance, but how do I check whether parallel cloning is running in parallel?

Thanks for the review, @dgitis! I apologize for the lack of explanation, but this pull request does not address parallel cloning.

I worked on parallel cloning in #303, but closed it because the implementation was inadequate. You can confirm parallel cloning behavior by running $ dbt run -s +base_ga4__events in #303.

Reasons for poor implementation:

This pull request only addresses BugFix and Fixed timeout in clone operation.

I have removed the following from Description & motivation. Deleted sections are not addressed in this pull request. If I come up with a better approach than #303, I'll create a new pull request.

### TODO

This pull request does not address it, but the content seems worthwhile for us to tackle.

#### Merges only the data for the specified property_id into base_ga4__events.

For example, "Merges only the data for property_id `000000041` and `000000042` into `base_ga4__events`".

$ dbt run-operation insert_overwrite_properties_data --args '{ property_ids: [000000041, 000000042], start_date: 20210101, end_date: 20240324 }'

To execute `{{ ga4.combine_property_data(property_ids, start_date, end_date) }}` within `insert_overwrite_properties_data`.
We don't need to re-increment the data for property_ids that are already included in base_ga4__events, so I want to address this.

#### Execute combine_property_data in parallel to speed up cloning.

The content I wanted to address in #303.

My dbt project has more than 40 property_ids.
I want to combine these property_ids in parallel to reduce latency.
I'll create a new pull request if I come up with a better approach.

Also, why do you nest the combine_specified_property_data macro in the combine_property_data macro?

The outer macro basically does nothing other than a for ... in statement. Is this structure necessary for the parallelism to work?

I was thinking of cloning only the specified property_id.

$ dbt run-operation combine_specified_property_data --args '{ property_id: 000000100, start_date: 20240101, end_date: 20240325 }'
$ dbt run-operation combine_specified_property_data --args '{ property_id: 000000101, start_date: 20240301, end_date: 20240325 }'

Parallelism to work does not require combine_specified_property_data. So in the commit 5d829f1437ab4bede1bf7536e7032d94f0899452, I moved the processing to combine_property_data.

Minimally, I'll want you to update the macro to use adapter.dispatch with a default__combine_specified_property_data. This allows people to override the macro which I think is pretty important in this case.

I removed combine_specified_property_data so I didn't address that aspect.

dgitis commented 6 months ago

My apologies. I understand.

I think I can probably reproduce this by copying the same source table over and over again and test this fix. I've tested some multi-site issues that way before and I know our code doesn't check for the same ID listed more than once in property_ids.