Velir / dbt-ga4

dbt Package for modeling raw data exported by Google Analytics 4. BigQuery support, only.
MIT License
312 stars 134 forks source link

Multi-site incompatible with streaming frequency #186

Closed 3v-dgudaitis closed 1 year ago

3v-dgudaitis commented 1 year ago

The combine_property_data macro doesn't work with frequency: "streaming" or frequency: "daily+streaming".

In the first case, it errors on base_ga4__events because the macro isn't coded to handle streaming.

In the second case, it errors on base_ga4__events_intraday after successfully processing the base_ga4__events table.

Because of how frequently Google delays processing for more than a day with daily updates, using static_incremental_days: 1 is very fragile. For large sites, then there is a risk of missing data because batch data may not of processed for some sites in a multi-site installation, but the run will proceed without error.

As a result, it seems that streaming frequency should be the preferred method for multi-site installations as this guarantees that the expected data is present and thus minimizing any reprocessing.

The combine_property_macro should support streaming.

Additionally, we should consider whether the macro should either error or warn when a property ID configured in property_ids does not match any tables so that users with daily updates know to reprocess the data. Alternately, we should make it clear that they will want to reprocess several days worth of data to minimize data loss.

adamribaudo-velir commented 1 year ago

I do think adding streaming support is valuable, but I'm not following the logic of: "using static_incremental_days: 1 is very fragile" therefore "it seems that streaming frequency should be the preferred method for multi-site installations"

The typical setting for static_incremental_days is 3 to reprocess data and account for updates to the data that Google may introduce within 72hrs. I'd never recommend setting it to 1

The main benefit of streaming, imo, is having access to same-day data, but I don't see that listed in your issue. Just thought it would be good to align on the goal of this feature.

dgitis commented 1 year ago

Doing the majority of my work with a couple of high-volume sites is coloring my thinking here. With high-volume, you don't care about Google fiddling with data after-the-fact because you have to use just streaming. You only need to account for late-arriving hits from mobile, which aren't an issue on web-only installations. You also don't want to reprocess partitions unless you have to. So, for frequency: "streaming" on a web-only project, you will want static_incremental_days set to 1.

But, for batch, you are definitely correct and batch with incremental days set to 1 causes further problems when Google's processing is delayed.

willbryant commented 1 year ago

One thing to bear in mind is that even with streaming, Google does adjust the data after the fact, specifically for conversion info, as they get the data from their other services behind the scenes asynchronously.

dgitis commented 1 year ago

Are you certain about that @willbryant? I presume that you mean attribution info, since there's no conversion info in the export. Even still, I haven't noticed the streaming export change.

My understanding of mobile, which is weak, is that the app can hold the events until a device connects to wi-fi. I know that you, Will, work with mobile, so I wonder if the adjustments that you see are Firebase holding events until they can send them without consuming wireless data or Google doing so.

In a streaming context, as far as I understand, you only get late arriving hits from mobile. You lose attribution data from Google Ads and you lose any machine learning numbers that Google decides to share with us. You possibly lose any server-side bot removal, but it seems like the most impactful bot removal of late comes client-side.

Am I missing anything?

willbryant commented 1 year ago

Sorry I meant attribution, yeah - but for web, not for mobile.

I've definitely seen traffic_source values not be filled in for the "intraday" and then populated for the regular tables later. I think I remember a 3rd-party post saying that they will fill it in intraday for a new attribution, but only in the daily process for attributions back to previous attributions. I haven't tested that but it fit with my observations.

dgitis commented 1 year ago

I'm only aware of attribution changing for Google Ads, which is something that we've mitigated, but I'd prefer to know if I'm wrong.

adamribaudo commented 1 year ago

I think @willbryant is talking about how user-level attribution is missing from the streaming export (see https://support.google.com/analytics/answer/9358801?hl=en) but then appears in the daily tables. @dgitis is talking about the case when GA4 is exporting to streaming ONLY (because their event volume exceeds the 1M events/day limit on daily export).

So I think you're both correct for those situations. User-level attribution updates in the first case, but not in the second case (though I don't have access to any projects that are streaming-only to verify).

dgitis commented 1 year ago

Resolved by #187