Velir / dbt-ga4

dbt Package for modeling raw data exported by Google Analytics 4. BigQuery support, only.
MIT License
289 stars 128 forks source link

Partition user and client tables #285

Open dgitis opened 7 months ago

dgitis commented 7 months ago

It would be good to have daily partitioned versions of the dim_ga4__client_keys, fct_ga4__client_keys, and fct_ga4__user_ids similar to what we have with sessions so that larger sites can disable the non-partitioned models without needing to customize.

The new GA4 user export tables are day partitioned.

I believe this should be related to #251 with us adding an optional cutoff date for when to start using Google's user export (because even when enabled, they didn't immediately start receiving all of the data) and merge the two sources of data in the daily tables and then build the non-day partitioned tables from the merged daily tables.

When comparing our client_key fields with the equivalent pseudonymous_users table in the new export, I think it is best that we set up our daily tables to contain basically the same data as is in the new export renamed and unnested to our usually standard. We then try to build as much as possible from before the cutoff into that table.

For the non-partitioned tables, do we try to maintain compatibility with our existing fields? For example, the first_device_* and first_geo_* fields don't have equivalents in the GA4 export.

While it would be nice to maintain compatibility, I personally don't use most of those fields.

If others use them, then I'm happy to rebuild that downstream of the daily models.

I am resistant to rebuilding that data on the daily models because if you're trying to reduce the costs by using just the daily models despite less accurate data then you probably won't want to do the look-ups required to enhance the daily models either. Particularly if you don't use the fields all that often.

Thoughts @adamribaudo @willbryant ?

adamribaudo-velir commented 7 months ago

Waiting for access to a dataset that actually holds this data before weighing in. Should be soon.