Velir / dbt-ga4

dbt Package for modeling raw data exported by Google Analytics 4. BigQuery support, only.
MIT License
289 stars 128 forks source link

Multi-Property Support cannot be used( Looking forward to answering) #286

Closed echolun closed 7 months ago

echolun commented 7 months ago

Hi, I have a problem about Multi-Property Support. As a novice dbt user, I found that it could not take effect when using GA4 Multi-Property Support. I have consulted the documents and issues in this warehouse, but I could not get an answer. I want to make statistics on product registration, but the GA data of web side and app side are scattered in two dataset, the GA dataset of web side is analytics_7866, and the dataset of web side is analytics_9513. Here is my profiles configuration:

jaffle_shop:
  target: dev
  outputs:
    dev:
      type: bigquery
      method: service-account
      project: notta-data-analytics
      dataset: analytics_9513
      threads: 4
      keyfile: test-data-analytics-bef750568504.json

and this is my dbt_project configuration:

name: "jaffle_shop"

config-version: 2
version: "0.1"

profile: "jaffle_shop"

model-paths: ["models"]
seed-paths: ["seeds"]
test-paths: ["tests"]
analysis-paths: ["analysis"]
macro-paths: ["macros"]

target-path: "target"
clean-targets:
  - "target"
  - "dbt_modules"
  - "logs"

require-dbt-version: [">=1.0.0", "<2.0.0"]

models:
  jaffle_shop:
    materialized: table
    staging:
      materialized: view
  ga4:
    staging:
      recommended_events:
        stg_ga4__event_sign_up:
          +enabled: true
      base:
        base_ga4__events:
          +full_refresh: false

vars:
  ga4:
    project: "notta-data-analytics"
    property_ids: [9513, 7866]
    dataset: "analytics_9513"
    start_date: "20221010"
    static_incremental_days: 2

I'm sure the dbt core, dbt biqquery, and GA4 packages are installed, and I have access to both datasets, but when I run dbt, there are no errors in the log. I found that models with GA4 presets were created under analytics_9513, and when I went to base_ga4__events under analytics_9513 to look up the analytics_7866 specific event name, I found that I could not find any records, and at this point base_ga4__events only contained all the events of analytics_9513. My understanding of Multi-Property Support is as follows:

image

I am not sure if my understanding is wrong. Now I am looking forward to Multi-Property Support to help me aggregate the data of multiple data sets into a base_ga4__events, which will be conducive to my overall product registration data statistics. But unfortunately, I can't do that now. I am looking forward to receiving the answer and looking forward to receiving the reply. Thanks. Also, this is the log when I execute dbt run, and you can see that there really are no errors:

dbt run
11:44:20  Running with dbt=1.6.6
11:44:21  Registered adapter: bigquery=1.6.7
11:44:21  Unable to do partial parsing because a project config has changed
11:44:23  Found 32 models, 1 seed, 21 tests, 2 sources, 0 exposures, 0 metrics, 524 macros, 0 groups, 0 semantic models
11:44:23  
11:44:24  Concurrency: 4 threads (target='dev')
11:44:24  
11:44:24  1 of 32 START sql incremental model analytics_9513.base_ga4__events ............ [RUN]
11:44:32  1 of 32 OK created sql incremental model analytics_9513.base_ga4__events ....... [MERGE (0.0 rows, 0 processed) in 8.02s]
11:44:32  2 of 32 START sql view model analytics_9513.stg_ga4__events .................... [RUN]
11:44:33  2 of 32 OK created sql view model analytics_9513.stg_ga4__events ............... [CREATE VIEW (0 processed) in 1.13s]
11:44:33  3 of 32 START sql incremental model analytics_9513.fct_ga4__sessions_daily ..... [RUN]
11:44:33  4 of 32 START sql table model analytics_9513.stg_ga4__client_key_first_last_events  [RUN]
11:44:33  5 of 32 START sql view model analytics_9513.stg_ga4__event_click ............... [RUN]
11:44:33  6 of 32 START sql view model analytics_9513.stg_ga4__event_file_download ....... [RUN]
11:44:34  5 of 32 OK created sql view model analytics_9513.stg_ga4__event_click .......... [CREATE VIEW (0 processed) in 1.26s]
11:44:34  7 of 32 START sql view model analytics_9513.stg_ga4__event_first_visit ......... [RUN]
11:44:34  6 of 32 OK created sql view model analytics_9513.stg_ga4__event_file_download .. [CREATE VIEW (0 processed) in 1.46s]
11:44:34  8 of 32 START sql view model analytics_9513.stg_ga4__event_items ............... [RUN]
11:44:35  7 of 32 OK created sql view model analytics_9513.stg_ga4__event_first_visit .... [CREATE VIEW (0 processed) in 1.18s]
11:44:35  9 of 32 START sql view model analytics_9513.stg_ga4__event_page_view ........... [RUN]
11:44:36  8 of 32 OK created sql view model analytics_9513.stg_ga4__event_items .......... [CREATE VIEW (0 processed) in 1.14s]
11:44:36  10 of 32 START sql view model analytics_9513.stg_ga4__event_scroll ............. [RUN]
11:44:36  9 of 32 OK created sql view model analytics_9513.stg_ga4__event_page_view ...... [CREATE VIEW (0 processed) in 1.11s]
11:44:36  11 of 32 START sql view model analytics_9513.stg_ga4__event_session_start ...... [RUN]
11:44:37  10 of 32 OK created sql view model analytics_9513.stg_ga4__event_scroll ........ [CREATE VIEW (0 processed) in 1.19s]
11:44:37  12 of 32 START sql view model analytics_9513.stg_ga4__event_sign_up ............ [RUN]
11:44:37  3 of 32 OK created sql incremental model analytics_9513.fct_ga4__sessions_daily  [MERGE (0.0 rows, 0 processed) in 3.86s]
11:44:37  13 of 32 START sql view model analytics_9513.stg_ga4__event_to_query_string_params  [RUN]
11:44:37  4 of 32 OK created sql table model analytics_9513.stg_ga4__client_key_first_last_events  [CREATE TABLE (27.5k rows, 18.8 MiB processed) in 4.54s]
11:44:37  14 of 32 START sql view model analytics_9513.stg_ga4__event_user_engagement .... [RUN]
11:44:38  11 of 32 OK created sql view model analytics_9513.stg_ga4__event_session_start . [CREATE VIEW (0 processed) in 1.18s]
11:44:38  15 of 32 START sql view model analytics_9513.stg_ga4__event_video_complete ..... [RUN]
11:44:38  13 of 32 OK created sql view model analytics_9513.stg_ga4__event_to_query_string_params  [CREATE VIEW (0 processed) in 1.15s]
11:44:38  16 of 32 START sql view model analytics_9513.stg_ga4__event_video_start ........ [RUN]
11:44:38  12 of 32 OK created sql view model analytics_9513.stg_ga4__event_sign_up ....... [CREATE VIEW (0 processed) in 1.22s]
11:44:38  17 of 32 START sql view model analytics_9513.stg_ga4__event_view_search_results  [RUN]
11:44:39  14 of 32 OK created sql view model analytics_9513.stg_ga4__event_user_engagement  [CREATE VIEW (0 processed) in 1.13s]
11:44:39  18 of 32 START sql view model analytics_9513.stg_ga4__page_engaged_time ........ [RUN]
11:44:39  15 of 32 OK created sql view model analytics_9513.stg_ga4__event_video_complete  [CREATE VIEW (0 processed) in 1.14s]
11:44:39  19 of 32 START sql view model analytics_9513.stg_ga4__sessions_first_last_pageviews  [RUN]
11:44:39  16 of 32 OK created sql view model analytics_9513.stg_ga4__event_video_start ... [CREATE VIEW (0 processed) in 1.16s]
11:44:39  20 of 32 START sql view model analytics_9513.stg_ga4__sessions_traffic_sources . [RUN]
11:44:39  17 of 32 OK created sql view model analytics_9513.stg_ga4__event_view_search_results  [CREATE VIEW (0 processed) in 1.20s]
11:44:39  21 of 32 START sql incremental model analytics_9513.stg_ga4__sessions_traffic_sources_daily  [RUN]
11:44:40  18 of 32 OK created sql view model analytics_9513.stg_ga4__page_engaged_time ... [CREATE VIEW (0 processed) in 1.14s]
11:44:40  22 of 32 START sql view model analytics_9513.stg_ga4__user_id_mapping .......... [RUN]
11:44:40  19 of 32 OK created sql view model analytics_9513.stg_ga4__sessions_first_last_pageviews  [CREATE VIEW (0 processed) in 1.04s]
11:44:40  23 of 32 START sql view model analytics_9513.stg_ga4_app_sign_app .............. [RUN]
11:44:40  20 of 32 OK created sql view model analytics_9513.stg_ga4__sessions_traffic_sources  [CREATE VIEW (0 processed) in 1.11s]
11:44:40  24 of 32 START sql table model analytics_9513.stg_ga4__client_key_first_last_pageviews  [RUN]
11:44:41  22 of 32 OK created sql view model analytics_9513.stg_ga4__user_id_mapping ..... [CREATE VIEW (0 processed) in 1.13s]
11:44:41  25 of 32 START sql table model analytics_9513.fct_ga4__sessions ................ [RUN]
11:44:42  23 of 32 OK created sql view model analytics_9513.stg_ga4_app_sign_app ......... [CREATE VIEW (0 processed) in 1.15s]
11:44:42  26 of 32 START sql incremental model analytics_9513.fct_ga4__pages ............. [RUN]
11:44:43  21 of 32 OK created sql incremental model analytics_9513.stg_ga4__sessions_traffic_sources_daily  [MERGE (0.0 rows, 29.9 KiB processed) in 3.83s]
11:44:43  27 of 32 START sql table model analytics_9513.dim_ga4__sessions ................ [RUN]
11:44:44  25 of 32 OK created sql table model analytics_9513.fct_ga4__sessions ........... [CREATE TABLE (29.2k rows, 3.0 MiB processed) in 3.31s]
11:44:44  28 of 32 START sql incremental model analytics_9513.stg_ga4__sessions_traffic_sources_last_non_direct_daily  [RUN]
11:44:45  24 of 32 OK created sql table model analytics_9513.stg_ga4__client_key_first_last_pageviews  [CREATE TABLE (7.8k rows, 16.2 MiB processed) in 4.35s]
11:44:45  29 of 32 START sql table model analytics_9513.fct_ga4__client_keys ............. [RUN]
11:44:46  26 of 32 OK created sql incremental model analytics_9513.fct_ga4__pages ........ [MERGE (0.0 rows, 8.0 MiB processed) in 4.20s]
11:44:46  30 of 32 START sql table model analytics_9513.dim_ga4__client_keys ............. [RUN]
11:44:48  27 of 32 OK created sql table model analytics_9513.dim_ga4__sessions ........... [CREATE TABLE (23.2k rows, 10.6 MiB processed) in 4.59s]
11:44:48  28 of 32 OK created sql incremental model analytics_9513.stg_ga4__sessions_traffic_sources_last_non_direct_daily  [MERGE (0.0 rows, 3.4 MiB processed) in 3.39s]
11:44:48  31 of 32 START sql incremental model analytics_9513.dim_ga4__sessions_daily .... [RUN]
11:44:48  29 of 32 OK created sql table model analytics_9513.fct_ga4__client_keys ........ [CREATE TABLE (27.5k rows, 2.8 MiB processed) in 3.64s]
11:44:48  32 of 32 START sql table model analytics_9513.fct_ga4__user_ids ................ [RUN]
11:44:50  30 of 32 OK created sql table model analytics_9513.dim_ga4__client_keys ........ [CREATE TABLE (27.5k rows, 14.0 MiB processed) in 3.69s]
11:44:51  31 of 32 OK created sql incremental model analytics_9513.dim_ga4__sessions_daily  [MERGE (0.0 rows, 0 processed) in 3.76s]
11:44:52  32 of 32 OK created sql table model analytics_9513.fct_ga4__user_ids ........... [CREATE TABLE (27.5k rows, 3.4 MiB processed) in 3.89s]
11:44:52  
11:44:52  Finished running 6 incremental models, 19 view models, 7 table models in 0 hours 0 minutes and 29.50 seconds (29.50s).
11:44:52  
11:44:52  Completed successfully
11:44:52  
11:44:52  Done. PASS=32 WARN=0 ERROR=0 SKIP=0 TOTAL=32
adamribaudo-velir commented 7 months ago

Thanks for the thorough issue. I'm not certain what is going on, but I have a few recommendations:

I'm not 100% those changes will solve this, but it's worth trying first.

echolun commented 7 months ago

Thank you very much for your reply. First of all, I understand that the location where GA4 creates the default model should not be the same data set as the source data set, so I will create another data set separately, such as dev_ga4_echolun, and I will modify the profile configuration to:

jaffle_shop:
  target: dev
  outputs:
    dev:
      type: bigquery
      method: service-account
      project: notta-data-analytics
      dataset: dev_ga4_echolun
      threads: 4
      keyfile: test-data-analytics-bef750568504.json

On your second point, should I create another dataset for Multi-Property Support to hold the merged data? such as analytics_combined? If so, I will modify my dbt_project configuration to:

vars:
  ga4:
    project: "notta-data-analytics"
    property_ids: [9513, 7866]
    dataset: "analytics_combined"
    start_date: "20221010"
    static_incremental_days: 2

I actually have another question. I found that both profile and dbt_project have a dataset in their GA4 configuration. So far, I can't tell the difference between them very well. I found that the profile dataset determines where GA4's default model will be created, but I didn't understand the role of the dataset under the GA4 configuration, and suppose I created the above two datasets, I still can't understand how dev_ga4_echolun and analytics_combined will work together. Please understand that I may have too many questions. (I tried to use GPT to solve the problems I raised, but all failed) Of course, I will try it first based on your suggestion. If you have time to read it, I look forward to your reply.

dgitis commented 7 months ago

On your second point, should I create another dataset for Multi-Property Support to hold the merged data? such as analytics_combined?

This is correct.

I actually have another question. I found that both profile and dbt_project have a dataset in their GA4 configuration.

The dbt_project.yml file contains project configuration while the profiles.yml file contains warehouse connection information.

The dataset in the dbt_project.yml file in a single-property installation is the dataset into which GA4 is exporting your data. In a multi-property installation, it is the folder that we clone all of your sites into. In the base_ga4__events model, we have a pre-hook configured which calls a macro that clones all of the source properties into a single folder which lets us interface as all sites without knowing how many sites exist.

Clone operations are free but they take a lot of time.

Your problem could be because you may be trying to process more the one-year's worth of data at a time start_date: "20221010". I would expect a timeout error rather then no error, so I think Adam's suggestion will fix the problem that you are looking at, but you'll timeout when you fix that error.

You may need to do a manual clone in BigQuery to get 1-year of data from two sites.

I think the clone takes about 5 minutes per clone so you may be able to clone by increasing your job_execution_timeout_seconds setting in your profile.yml file but that's 5 minutes x 60 seconds per minute x 390 days(approx) x 2 sites which should give a timeout of 234,000 seconds or 65 hours.

That's why we recommend setting your base_ga4__events model to --full-refresh: false in multi-property installations.

I think you'll need to manually clone. If you do end up going that route, please report here the steps that you took. It would be nice to document the commands needed for that so that it's easier for others to follow.

echolun commented 7 months ago

Hi, I took your advice and now I have created two new data sets under bigquery, one named dev_ga4_test for profiles and one named analytics_combined for dbt_project. To account for the time-out problem you mentioned, I changed the start date of the data I need to aggregate and added the job_execution_timeout_seconds configuration. Now I modify my profiles to:

jaffle_shop:
  target: dev
  outputs:
    dev:
      type: bigquery
      method: service-account
      project: notta-data-analytics
      dataset: dev_ga4_test
      threads: 4
      keyfile: test-data-analytics-bef750568504.json
      job_execution_timeout_seconds: 3600

And modify my dbt_project configuration to:

models:
  jaffle_shop:
    materialized: table
    staging:
      materialized: view
  ga4:
    staging:
      recommended_events:
        stg_ga4__event_sign_up:
          +enabled: true
      base:
        base_ga4__events:
          +full_refresh: false

vars:
  ga4:
    project: "notta-data-analytics"
    property_ids: [9513, 7866]
    dataset: "analytics_combined"
    start_date: "20231102"
    static_incremental_days: 2

After I executed dbt seed, I can see that there is an extra table of type mappings under dev_ga4_test. image Now when I execute dbt run, I get the following error:

dbt run
04:22:52  Running with dbt=1.6.6
04:22:53  Registered adapter: bigquery=1.6.7
04:22:53  Found 32 models, 1 seed, 21 tests, 2 sources, 0 exposures, 0 metrics, 524 macros, 0 groups, 0 semantic models
04:22:53  
04:22:54  Concurrency: 4 threads (target='dev')
04:22:54  
04:22:54  1 of 32 START sql incremental model dev_ga4_test.base_ga4__events .............. [RUN]
04:23:00  BigQuery adapter: https://console.cloud.google.com/bigquery?project=notta-data-analytics&j=bq:US:3a8-7f73-4bf8-9301-3b3c480&page=queryresults
04:23:00  1 of 32 ERROR creating sql incremental model dev_ga4_test.base_ga4__events ..... [ERROR in 5.94s]
04:23:00  2 of 32 SKIP relation dev_ga4_test.stg_ga4__events ............................. [SKIP]
04:23:00  3 of 32 SKIP relation dev_ga4_test.fct_ga4__sessions_daily ..................... [SKIP]
04:23:00  4 of 32 SKIP relation dev_ga4_test.stg_ga4__client_key_first_last_events ....... [SKIP]
04:23:00  5 of 32 SKIP relation dev_ga4_test.stg_ga4__event_click ........................ [SKIP]
04:23:00  7 of 32 SKIP relation dev_ga4_test.stg_ga4__event_first_visit .................. [SKIP]
04:23:00  6 of 32 SKIP relation dev_ga4_test.stg_ga4__event_file_download ................ [SKIP]
04:23:00  8 of 32 SKIP relation dev_ga4_test.stg_ga4__event_items ........................ [SKIP]
04:23:00  9 of 32 SKIP relation dev_ga4_test.stg_ga4__event_page_view .................... [SKIP]
04:23:00  10 of 32 SKIP relation dev_ga4_test.stg_ga4__event_scroll ...................... [SKIP]
04:23:00  11 of 32 SKIP relation dev_ga4_test.stg_ga4__event_session_start ............... [SKIP]
04:23:00  12 of 32 SKIP relation dev_ga4_test.stg_ga4__event_sign_up ..................... [SKIP]
04:23:00  13 of 32 SKIP relation dev_ga4_test.stg_ga4__event_to_query_string_params ...... [SKIP]
04:23:00  14 of 32 SKIP relation dev_ga4_test.stg_ga4__event_user_engagement ............. [SKIP]
04:23:00  15 of 32 SKIP relation dev_ga4_test.stg_ga4__event_video_complete .............. [SKIP]
04:23:00  16 of 32 SKIP relation dev_ga4_test.stg_ga4__event_video_start ................. [SKIP]
04:23:00  17 of 32 SKIP relation dev_ga4_test.stg_ga4__event_view_search_results ......... [SKIP]
04:23:00  18 of 32 SKIP relation dev_ga4_test.stg_ga4__page_engaged_time ................. [SKIP]
04:23:00  19 of 32 SKIP relation dev_ga4_test.stg_ga4__sessions_first_last_pageviews ..... [SKIP]
04:23:00  20 of 32 SKIP relation dev_ga4_test.stg_ga4__sessions_traffic_sources .......... [SKIP]
04:23:00  21 of 32 SKIP relation dev_ga4_test.stg_ga4__sessions_traffic_sources_daily .... [SKIP]
04:23:00  22 of 32 SKIP relation dev_ga4_test.stg_ga4__user_id_mapping ................... [SKIP]
04:23:00  23 of 32 SKIP relation dev_ga4_test.stg_ga4_app_sign_app ....................... [SKIP]
04:23:00  24 of 32 SKIP relation dev_ga4_test.fct_ga4__sessions .......................... [SKIP]
04:23:00  25 of 32 SKIP relation dev_ga4_test.stg_ga4__client_key_first_last_pageviews ... [SKIP]
04:23:00  26 of 32 SKIP relation dev_ga4_test.fct_ga4__pages ............................. [SKIP]
04:23:00  27 of 32 SKIP relation dev_ga4_test.dim_ga4__sessions .......................... [SKIP]
04:23:00  28 of 32 SKIP relation dev_ga4_test.stg_ga4__sessions_traffic_sources_last_non_direct_daily  [SKIP]
04:23:00  29 of 32 SKIP relation dev_ga4_test.fct_ga4__client_keys ....................... [SKIP]
04:23:00  30 of 32 SKIP relation dev_ga4_test.dim_ga4__client_keys ....................... [SKIP]
04:23:00  31 of 32 SKIP relation dev_ga4_test.dim_ga4__sessions_daily .................... [SKIP]
04:23:00  32 of 32 SKIP relation dev_ga4_test.fct_ga4__user_ids .......................... [SKIP]
04:23:00  
04:23:00  Finished running 6 incremental models, 19 view models, 7 table models in 0 hours 0 minutes and 7.02 seconds (7.02s).
04:23:00  
04:23:00  Completed with 1 error and 0 warnings:
04:23:00  
04:23:00    Database Error in model base_ga4__events (models/staging/base/base_ga4__events.sql)
  notta-data-analytics:analytics_combined.events_* does not match any table.
  compiled Code at target/run/ga4/models/staging/base/base_ga4__events.sql
04:23:00  
04:23:00  Done. PASS=0 WARN=0 ERROR=1 SKIP=31 TOTAL=32

The error shows notta-data-analytics:analyticscombined.events* does not match any table. analytics_combined is my newly created data set, and there are really no tables under it. Won't GA4 help me aggregate the data of analytics_7866 and analytics_9513 and copy it to analytics_combined? Or should I have manually merged and copied data from analytics_7866 and analytics_9513 to analytics_combined in the first place? My previous understanding is that GA4 will help me do this aggregation, which is also the main role of Multi-Property Support. Looking forward to your reply,thanks.

echolun commented 7 months ago

In addition, I did a second test, considering the possible timeout caused by too much data, I still used my original configuration, that is, set the data set in the GA4 configuration to my GA source data set, although this is not recommended, and then adjusted the start_date. The current dbt_project configuration is:

models:
  jaffle_shop:
    materialized: table
    staging:
      materialized: view
  ga4:
    staging:
      recommended_events:
        stg_ga4__event_sign_up:
          +enabled: true
      base:
        base_ga4__events:
          +full_refresh: false

vars:
  ga4:
    project: "notta-data-analytics"
    property_ids: [9513, 7866]
    dataset: "analytics_9513"
    start_date: "20231102"
    static_incremental_days: 2

and my profiles configuration is:

jaffle_shop:
  target: dev
  outputs:
    dev:
      type: bigquery
      method: service-account
      project: notta-data-analytics
      dataset: analytics_9513
      threads: 4
      keyfile: notta-data-analytics-bef750568504.json
      job_execution_timeout_seconds: 3600

Obviously, I increased the timeout configuration and adjusted the start_date until November of this year, when I executed dbt run, I found that the result was the same as before, without any errors. And the base_ga4__events under analytics_9513 cannot find any buried events about app. This is the test result information I have obtained so far according to the advice.

adamribaudo-velir commented 7 months ago

Thanks. I believe you would see an error regarding timeouts if this were a timeout issue.

Your source data is not being copied into the analytics_combined schema which indicates there's an issue running the combine_property_data macro here: https://github.com/Velir/dbt-ga4/blob/main/macros/combine_property_data.sql

The main component of that macro is this:

CREATE OR REPLACE TABLE `{{var('project')}}.{{var('dataset')}}.events_{{relation_suffix}}{{property_id}}` CLONE `{{var('project')}}.analytics_{{property_id}}.events_{{relation_suffix}}`;

It attempts to create new tables under analytics_combined by cloning tables from each source dataset (as specified by the properties variable in dbt).

I'm not sure what's going on the. One thing that is suspicious to me is that your property IDs are so short. Every property ID I've come across is 9 digits but yours are 4. Can you share a screenshot of the datasets and tables you're copying from? Just want to see if I spot anything odd with the formatting of the dataaset and table names.

echolun commented 7 months ago

Of course, in fact, the metadata is in the red box.considering that I am not very familiar with dbt now, I took the last four digits of the original data set id to define two data sets specifically for test learning.

image

As for time-out, I still haven't seen similar error at present. After I accepted your suggestion to create two data sets, I encountered another error in operation, which I provided in the above answer. Regarding the id length issue, I am not sure if there is a length limit in the underlying GA4 implementation?

adamribaudo-velir commented 7 months ago

I'm not following why you created 2 new data sets but can you point the dbt-ga4 package to the original analytics_XXXXXXXXX datasets and try again?

echolun commented 7 months ago

Hello, I modified the id of the data set to 9 digits according to your suggestion.this is my dbt project config:

vars:
  ga4:
    project: "notta-data-analytics"
    property_ids: [344009513, 235597866]
    dataset: "analytics_combined"
    start_date: "20231102"
    static_incremental_days: 2

and this is my profiles config:

jaffle_shop:
  target: dev
  outputs:
    dev:
      type: bigquery
      method: service-account
      project: notta-data-analytics
      dataset: dev_ga4_test
      threads: 4
      keyfile: test-data-analytics-bef750568504.json
      job_execution_timeout_seconds: 3600

I did a test and found that I encountered the error I asked before. The error message is as follows:

 Database Error in model base_ga4__events (models/staging/base/base_ga4__events.sql)
  notta-data-analytics:analytics_combined.events_* does not match any table.
  compiled Code at target/run/ga4/models/staging/base/base_ga4__events.sql

I don't think the id length is a problem, because the judgment of the data set in combine_property_data is just a simple concatenation of characters:

{% for property_id in var('property_ids') %}
        {%- set schema_name = "analytics_" + property_id|string -%}

And I can be sure that these data sets are real and have access.

I gradually understood the error I raised before. I don't think combine_property_data is performing as expected.I set the dataset to analytics_9513 in the original question. But analytics_9513 was itself the data source, and it continued to support the subsequent execution of GA4.So it doesn't make any errors, because I haven't done a data set merge, so I can't find any events outside of this data set.

But now I set the dataset to analytics_combined, and since this is a new and clean dataset, combine_property_dataexecution failed resulting in no tables under analytics_combined, This also causes all subsequent models of GA4 to fail, resulting in notta-data-analytics:analytics_combined.events_* does not match any table. I don't know if my guess is correct, but I think combine_property_data should do the data aggregation first before base_ga4__events is executed. I can't explain why combine_property_data is not executed at the moment, can I execute it manually?

Regarding the error mentioned above, I found that this issue also encountered a similar error, but I am not clear how he solved it.

Looking forward to your reply,thanks.

echolun commented 7 months ago

I've roughly analyzed the logs.This is my verification idea. I add log printing in get_relations_by_pattern, and the code is as follows:

    {%- if table_list and table_list['table'] -%}
        {%- set tbl_relations = [] -%}
        {%- for row in table_list['table'] -%}
            {%- set tbl_relation = api.Relation.create(
                database=database,
                schema=row.table_schema,
                identifier=row.table_name,
                type=row.table_type
            ) -%}
            {%- do tbl_relations.append(tbl_relation) -%}
        {%- endfor -%}
        {{- log("Table relations: " ~ tbl_relations) -}}
        {{ return(tbl_relations) }}
    {%- else -%}
        {{ return([]) }}
    {%- endif -%}

I try log tbl_relations at the end of the loop, which is the first step, and then this is my current dbt_project configuration:

vars:
  ga4:
    project: "demo-data-analytics"
    dataset: "analytics_combined"
    property_ids: [355009513, 244597866]
    static_incremental_days: 2
    start_date: "20231112"
    frequency: "daily"

Notice that my start_date is set to 20231112, and it's only 20231114 today, after which I run dbt run --debug, I will take a screenshot of some of the logs that I think are important:

image

In the figure above, you can see the data collection start time I set and the log tbl_relations I added, which is a very huge data that basically contains several years of data. Moreover, another data set of the whole year is also output in the log below. I am confused. When my start_date is set to 20231110, why does it still find all the data in both datasets?

image

Then the log comes to combine_property_data and we can see that we are now going to create a new table and merge it, and the merge here really only handles events_20231112 and events_20231113, which can't be a huge amount of data. But the creation here failed because at the end of the execution, I couldn't find any new tables under analytics_combined.

image

image

Finally, GA4 would create GA4 default tables such as base_ga4__events based on the merged data set, but the natural execution failed because the merge failed and analytics_combined was an empty data set.

This is my log analysis of the entire dbt run. Because I don't see any other useful errors in the entire log, I just know the data set and fail, even though it's not a very large amount of data.

I really need your help. At the moment, I have no way of knowing why multiple data sets fail.I hope these log information can help you analyze the problem. If you need more information, let me provide it.

adamribaudo-velir commented 7 months ago

I'm really sorry this isn't working for you, but it's hard to pinpoint the issue. All of the code handling the partition copy into the "combined" dataset is contained here: https://github.com/Velir/dbt-ga4/blob/main/macros/combine_property_data.sql

It's only 38 lines of code, so I recommend trying to run each step manually and determining which step is failing.

adamribaudo-velir commented 7 months ago

What happens if you just run this command in the BigQuery SQL editor?

image

echolun commented 7 months ago

Hi, after reading the combine_property_data.sql merge process, I found that GA4 only does the simple table merge, I always thought that GA4 would also do part of the pre-processing in the merge process, which is why I have always wanted to do the merge through GA4. But now I have decided to merge the tables manually (iteration time is tight), because even if I solve the merge problem, I may still be unable to circumvent the time-out problem caused by too much data. I have no trouble with this problem, thank you very much for your answer.

But now I have another problem. I find that the target dataset written by the GA4 model is the schema attribute of dbt_project combined with the dataset of profiles. For example:

This is my profiles:

bigquery_profile:
  target: prod
  outputs:
    dev:
      type: bigquery
      method: service-account
      project: notta-data-analytics
      dataset: dev
      ....
    prod:
      type: bigquery
      method: service-account
      project: notta-data-analytics
      dataset: dbt_models
      ....

This is how my dbt is configured for different directories

  ga4:
    staging: 
      +schema: "{{ 'mc_data_statistics' if target.name == 'dev' else 'ga4' }}"
      base:
        base_ga4__events:
          +full_refresh: false
      recommended_events:
        stg_ga4__event_sign_up:
          +enabled: true

When I run dbt run --target prod, the model is written to the dbt_models combined with ga4 which is the dbt_models_ga4 dataset.

When I run dbt run --target dev, the target book to which the model is written will be dev_mc_data_statistics.

But I found that the two files stg_ga4__sessions_traffic_sources and stg_ga4__sessions_traffic_sources_daily in the GA4 package rely on ga4_source_categories.However, the target dataset it seeks will not be concatenated as above, but directly to the dev or dbt_models set by the DATASET to look for, It does not dynamically concatenate schemas,resulting in no file found and an error.

Now I have to modify the GA4 source code to manually specify the ga4_source_categories position, I am not sure if there is a better way to solve this, at present I can only import GA4 as a local package and modify the source code, I think this is not a good practice. This will also cause me to have some trouble in upgrading GA4 in the future.

Before modification:

image

After modification:

image image
adamribaudo-velir commented 7 months ago

ga4_source_categories is a seed file, so I'm pretty sure you just need to set the target schema in the dbt_project.yml just like you did for the staging models.

you have:

ga4:
    staging: 
      +schema: "{{ 'mc_data_statistics' if target.name == 'dev' else 'ga4' }}"

But the path to the seed file is:

ga4:
   seeds:

(I think :) )

echolun commented 7 months ago

Oh my God, you're right, I was so focused on the models that I neglected that seed should also be configured to determine dev or prod. I tried it though, and the correct configuration is that GA4 should be under seed, which is my current configuration, and it really solved my problem, thank you very much!!!!

image

I may have one last question about GA4 😂, at the moment all I really need is the GA event for login registration, but I find that GA4 has a lot of models preset that these events will be executed every time I run dbt run.

Since my data volume is very large, and bigquery is billed by query volume, I want some events that are executed by default in GA4 to no longer be executed, and I really don't need them. Can I disable events through GA4 configuration? I don't see anything like that in the GA4 documentation.

adamribaudo-velir commented 7 months ago

The default materialization is view which means that creating those event models (ex: stg_ga4__event_click ) and running your dbt job won't cause any billing. You'll only be billed if you query that model.

That said, you can disable any models you like in your dbt_project.yml

https://docs.getdbt.com/reference/resource-configs/enabled

Also see here: https://github.com/Velir/dbt-ga4/tree/main/models/staging/recommended_events

echolun commented 7 months ago

Thanks for the reply, I now know how to disable some models.but what I didn't understand is that default models like stg_ga4__event_click rely on stg_ga4__events. And stg_ga4__events depends on base_ga4__events, where the model refinement depends on queries step by step, won't there be a cost?

dgitis commented 7 months ago

My apologies. I've been away on holidays. Thank you Adam for helping out here.

Those models are Views. Views are basically aliases to a query that would produce that model presented as if it actually existed but the model does not exist.

They are best used for models that are steps in building production models but aren't actually queried except in the build process.

You also see them used for joining production models so that users don't have to perform that join in the data vis tool which can often be clunky.

Those models don't cost anything more, they just give us a place to perform transformations on all event data (stg_ga4__events), or specific event data (stg_ga4__event_click). Because they use select *, they pick up upstream transformations.

For events, it's expected that you will add a fact table, fct_ga4__event_click on top of those models materialized as a table or incremental table where you select the data that you actually want users to have access to.

echolun commented 7 months ago

Thank you very much for your reply, until now I did not have a very deep understanding of tables and views, now I understand that views are only temporary stores of procedure steps, when I run dbt run --models stg_ga4__event_click it actually only saves the query statement, but does not query it immediately. I've been misunderstanding this before. Thanks to Damon and Adam for helping me answer a lot of questions, I have no more questions, I will close this issue. Have a nice day!