Velir / dbt-ga4

dbt Package for modeling raw data exported by Google Analytics 4. BigQuery support, only.
MIT License
290 stars 129 forks source link

Prevent sessions from splitting when spanning partitions #225

Closed ryan-systematik closed 1 year ago

ryan-systematik commented 1 year ago

This is harder to do than it seems. Here's a possible way that we can fix this.

We add a session_timeout parameter, defaulting to 30 minutes, and then check whether a session started within the session_timeout window and only reprocess sessions within that window.

-- change a single session value
case 
 when session_start_timestamp is null then <new value>
 when timestamp_diff( max_partition_timestamp, max(event_timestamp)) < session_timeout then <new value>
 else <current value>
end 

-- only update sessions within the timeout window
select * from x
where session_start_timestamp is null or  timestamp_diff( max_partition_timestamp, max(event_timestamp)) < session_timeout

The idea being that you process a full day (or days) worth of data plus a portion of today's data. We check the max partition timestamp against session timeout and reprocess the sessions within that window (which would have been partially processed in yesterday's run) but not touch sessions from yesterday's run that fall outside the timeout period because they should be complete.

The main consequence of this approach, is that users running daily dbt runs with just the "batch" option (versus "batch+streaming" or just "streaming") will break their sessions because they will only ever have one day worth of data in each overnight run and need to reprocess, weekly for example, to fix sessions.