This is harder to do than it seems. Here's a possible way that we can fix this.
We add a session_timeout parameter, defaulting to 30 minutes, and then check whether a session started within the session_timeout window and only reprocess sessions within that window.
-- change a single session value
case
when session_start_timestamp is null then <new value>
when timestamp_diff( max_partition_timestamp, max(event_timestamp)) < session_timeout then <new value>
else <current value>
end
-- only update sessions within the timeout window
select * from x
where session_start_timestamp is null or timestamp_diff( max_partition_timestamp, max(event_timestamp)) < session_timeout
The idea being that you process a full day (or days) worth of data plus a portion of today's data. We check the max partition timestamp against session timeout and reprocess the sessions within that window (which would have been partially processed in yesterday's run) but not touch sessions from yesterday's run that fall outside the timeout period because they should be complete.
The main consequence of this approach, is that users running daily dbt runs with just the "batch" option (versus "batch+streaming" or just "streaming") will break their sessions because they will only ever have one day worth of data in each overnight run and need to reprocess, weekly for example, to fix sessions.
This is harder to do than it seems. Here's a possible way that we can fix this.
We add a session_timeout parameter, defaulting to 30 minutes, and then check whether a session started within the session_timeout window and only reprocess sessions within that window.
The idea being that you process a full day (or days) worth of data plus a portion of today's data. We check the max partition timestamp against session timeout and reprocess the sessions within that window (which would have been partially processed in yesterday's run) but not touch sessions from yesterday's run that fall outside the timeout period because they should be complete.
The main consequence of this approach, is that users running daily dbt runs with just the "batch" option (versus "batch+streaming" or just "streaming") will break their sessions because they will only ever have one day worth of data in each overnight run and need to reprocess, weekly for example, to fix sessions.