Closed 3v-dgudaitis closed 1 year ago
I do think adding streaming support is valuable, but I'm not following the logic of: "using static_incremental_days: 1 is very fragile" therefore "it seems that streaming frequency should be the preferred method for multi-site installations"
The typical setting for static_incremental_days
is 3
to reprocess data and account for updates to the data that Google may introduce within 72hrs. I'd never recommend setting it to 1
The main benefit of streaming, imo, is having access to same-day data, but I don't see that listed in your issue. Just thought it would be good to align on the goal of this feature.
Doing the majority of my work with a couple of high-volume sites is coloring my thinking here. With high-volume, you don't care about Google fiddling with data after-the-fact because you have to use just streaming. You only need to account for late-arriving hits from mobile, which aren't an issue on web-only installations. You also don't want to reprocess partitions unless you have to. So, for frequency: "streaming"
on a web-only project, you will want static_incremental_days
set to 1
.
But, for batch, you are definitely correct and batch with incremental days set to 1 causes further problems when Google's processing is delayed.
One thing to bear in mind is that even with streaming, Google does adjust the data after the fact, specifically for conversion info, as they get the data from their other services behind the scenes asynchronously.
Are you certain about that @willbryant? I presume that you mean attribution info, since there's no conversion info in the export. Even still, I haven't noticed the streaming export change.
My understanding of mobile, which is weak, is that the app can hold the events until a device connects to wi-fi. I know that you, Will, work with mobile, so I wonder if the adjustments that you see are Firebase holding events until they can send them without consuming wireless data or Google doing so.
In a streaming context, as far as I understand, you only get late arriving hits from mobile. You lose attribution data from Google Ads and you lose any machine learning numbers that Google decides to share with us. You possibly lose any server-side bot removal, but it seems like the most impactful bot removal of late comes client-side.
Am I missing anything?
Sorry I meant attribution, yeah - but for web, not for mobile.
I've definitely seen traffic_source values not be filled in for the "intraday" and then populated for the regular tables later. I think I remember a 3rd-party post saying that they will fill it in intraday for a new attribution, but only in the daily process for attributions back to previous attributions. I haven't tested that but it fit with my observations.
I'm only aware of attribution changing for Google Ads, which is something that we've mitigated, but I'd prefer to know if I'm wrong.
I think @willbryant is talking about how user-level attribution is missing from the streaming export (see https://support.google.com/analytics/answer/9358801?hl=en) but then appears in the daily tables. @dgitis is talking about the case when GA4 is exporting to streaming ONLY (because their event volume exceeds the 1M events/day limit on daily export).
So I think you're both correct for those situations. User-level attribution updates in the first case, but not in the second case (though I don't have access to any projects that are streaming-only to verify).
Resolved by #187
The
combine_property_data
macro doesn't work withfrequency: "streaming"
orfrequency: "daily+streaming"
.In the first case, it errors on
base_ga4__events
because the macro isn't coded to handle streaming.In the second case, it errors on
base_ga4__events_intraday
after successfully processing thebase_ga4__events
table.Because of how frequently Google delays processing for more than a day with daily updates, using
static_incremental_days: 1
is very fragile. For large sites, then there is a risk of missing data because batch data may not of processed for some sites in a multi-site installation, but the run will proceed without error.As a result, it seems that streaming frequency should be the preferred method for multi-site installations as this guarantees that the expected data is present and thus minimizing any reprocessing.
The
combine_property_macro
should support streaming.Additionally, we should consider whether the macro should either error or warn when a property ID configured in
property_ids
does not match any tables so that users with daily updates know to reprocess the data. Alternately, we should make it clear that they will want to reprocess several days worth of data to minimize data loss.