Velir / dbt-ga4

dbt Package for modeling raw data exported by Google Analytics 4. BigQuery support, only.
MIT License
289 stars 128 forks source link

Fix error when setting a large number of properties #303

Closed yamotech closed 3 months ago

yamotech commented 4 months ago

Description & motivation

Motivation

Bugfix

Fix #269.

This change greatly reduces the likelihood of an error when specifying a large number of property_ids in ga4.combine_property_data().

Database Error in model base_ga4__events (models/staging/base/base_ga4__events.sql)
  The query is too large. The maximum standard SQL query length is 1024.00K characters, including comments and white space characters.

combine_property_data for parallel execution and speedup

My dbt project has more than 40 property_ids. I want to combine these property_ids in parallel to reduce latency.

Description

I have changed to create a model for each property_id and run a query to execute them in parallel. This may be a very specific way to implement this, so please let me consult with you to see if there are any problems with this method. Also, if you have a better way to do this, I would appreciate your advice.

I don't know of any other way to create the model other than creating a large number of int_ga4__combine_property_{{ INDEX }}. If you have any other good ideas, I would like to improve it.

Execution example

dbt_project.yml

vars:
    ga4:
        source_project: "source_project"
        property_ids: [000000000, 111111111, 222222222]
        start_date: "20240302"
        static_incremental_days: 3
        combined_dataset: "analytics_all"
$ dbt run -s +base_ga4__events
11:38:30  Running with dbt=1.5.0
11:39:27  Found 999 models, 9999 tests, 0 snapshots, 0 analyses, 999 macros, 0 operations, 999 seed files, 9999 sources, 99 exposures, 99 metrics, 0 groups
11:39:29
11:39:38  Concurrency: 4 threads (target='dev')
11:39:38
11:39:38  2 of 4 START sql execution model dataset_name.int_ga4__combine_property_000000000  [RUN]
11:39:38  1 of 4 START sql execution model dataset_name.int_ga4__combine_property_111111111  [RUN]
11:39:38  3 of 4 START sql execution model dataset_name.int_ga4__combine_property_222222222  [RUN]
11:39:47  2 of 4 OK created sql execution model dataset_name.int_ga4__combine_property_000000000  [SCRIPT (0 processed) in 9.30s]
11:39:48  3 of 4 OK created sql execution model dataset_name.int_ga4__combine_property_222222222  [SCRIPT (0 processed) in 9.82s]
11:39:48  1 of 4 OK created sql execution model dataset_name.int_ga4__combine_property_111111111  [SCRIPT (0 processed) in 9.83s]
11:39:48  4 of 4 START sql incremental model dataset_name.base_ga4__events ... [RUN]
11:40:19  4 of 4 OK created sql incremental model dataset_name.base_ga4__events  [MERGE (1.7m rows, 1.4 GiB processed) in 30.97s]
11:40:19
11:40:19  Finished running 3 execution models, 1 incremental model in 0 hours 0 minutes and 50.51 seconds (50.51s).
11:40:24
11:40:24  Completed successfully
11:40:24
11:40:24  Done. PASS=4 WARN=0 ERROR=0 SKIP=0 TOTAL=4

Checklist

yamotech commented 3 months ago

I've addressed everything except for the parallel execution of combine_property_data in #312. I'll create a new pull request if I come up with a better approach.