Velir / dbt-ga4

dbt Package for modeling raw data exported by Google Analytics 4. BigQuery support, only.
MIT License
289 stars 128 forks source link

multi-property option to analyze properties seperately #318

Open tessa-beijloos opened 2 months ago

tessa-beijloos commented 2 months ago

We can select multiple properties like this:

vars:
  ga4:
    property_ids: [11111111, 22222222, 33333333]
    static_incremental_days: 3
    combined_dataset: "my_combined_dataset"

And it then combines the dataset into one (with combined_property_data()). I need a way to analyze them seperately in the same project. Is there a workaround for this? Thanks in advance!

adamribaudo-velir commented 2 months ago

Yes, the events from each of those properties will each have a unique stream_id

tessa-beijloos commented 2 months ago

Thanks! But does that mean that we still first combine the datasets from both properties? I am using the bigquery connection and if i create the base table the query costs will be a lot higher if i first combine the tables and after that select different streaming ids to seperate them

adamribaudo-velir commented 2 months ago

It's up to you if you'd like 3 separate projects for 3 properties or 1 larger project with 3 properties. You can either process and analyze them independently or jointly. In either case, all data is partitioned on date, so you can limit costs by including date filters.

Hopefully that helps.

DVDH-000 commented 2 months ago

I suspect it would also help save cost if the base table creation in dbt_packages/ga4/models/staging/base/base_ga4__events.sql would be clustered on stream_id. What do you think @adamribaudo-velir ?

adamribaudo-velir commented 2 months ago

@DVDH-000 It's currently clustered on event_name. Whether clustering on event_name or stream_id is more performant likely depends on whether the user plans to analyze across streams or within a single stream which we can't predict.

I'm pretty certain that config setting can be overridden in the project yaml. And if anyone wants to run some empirical tests to demonstrate which is more performant under various scenarios, by all means :)