Velir / dbt-ga4

dbt Package for modeling raw data exported by Google Analytics 4. BigQuery support, only.
MIT License
312 stars 134 forks source link

Support New User ID Tables in GA4 BigQuery Export #205

Open dgitis opened 1 year ago

dgitis commented 1 year ago

Google is adding new fields and tables to the BigQuery export. Here's the announcement from Google IO.

The session-level attribution levels only remove the requirement to unnest data that we're already unnesting. However, they will be adding a new table for user_id data that will contain data for all users whose data changed in the export period (presumably this will be a partition).

For now, there's nothing that we can do without seeing the export in the wild, but I suggest we put a freeze on major changes to user data.

@willbryant @adamribaudo-velir

dgitis commented 1 year ago

Google has changed their documentation and have added the new session-level attribution columns to the BigQuery export.

The collected_traffic_source record is new and holds all of the fields highlighted in that Google IO announcement. I can't find an example where those fields are being populated.

My big question is whether these fields are being set for all events or for just the first batch. I presume it will be all events otherwise how does it differ from the existing source, medium, and gclid event parameters? However, this being Google, I don't want to count on it being how I expect.

adamribaudo-velir commented 1 year ago

Thanks for the heads up. Definitely something to monitor. We'll also have to decide whether to switch to this method or support both the legacy and new method to account for data collected before this change 🤔

dgitis commented 1 year ago

The collected_traffic_source fields are now populating. They seem to populate all events in except session_start and first_visit.

Here's a session with multiple timestamps that I found.

image

dgitis commented 1 year ago

I'd been hearing that the reality of the collected_traffic_sources is different from what I thought after my previous comment and I just verified that here is how these fields get populated.

The collected_traffic_sources fields get populated on all events that happen on the first page of a session except for first_visit and session_start. It is not the first batch events, as I disproved above.

Events on subsequent pages do not propagate the collected_traffic_sources fields.

Basically, these session fields only save us from unnesting the data.

Where session data is concerned, I consider this closed.

Google still plans on adding User-Id export to a completely separate table. I'll be updating the title of this to specify User-Id tables.

adamribaudo-velir commented 6 months ago

User ID & Pseudo User ID tables are documented here: https://support.google.com/analytics/answer/12769371

Just enabled the export on my company's GA4 instance so I hope to return to this soon.

dgitis commented 6 months ago

I checked this a couple of weeks ago and the daily user counts on these tables (user_is and user_pseudo_id) closely match the user counts on our equivalent tables.

These tables are substantially different from our user tables so I think the best path is to copy these into partitioned dim tables and have two separate sets of user tables with very different fields.

The other option is to enhance these new tables with the fields that we currently have in our old tables.

I'd rather be able to disable the old tables on high-volume projects because they aren't partitioned and many of the fields, like the first_* and last_* don't get used all that often.