DataExpert-io / cumulative-table-design

This repository helps teach people how to correctly define and create cumulative tables!
547 stars 109 forks source link

Cumulative Table Design

Cumulative table design is an extremely powerful data engineering tool that all data engineers should know.

This design produces tables that can provide efficient analyses on arbitrarily large (up to thousands of days) time frames

Here's a diagram of the high level pipeline design for this pattern:

Cumulative Table diagram

We initially build our daily metrics table that is at the grain of whatever our entity is. This data is derived from whatever event sources we have upstream.

After we have our daily metrics, we FULL OUTER JOIN yesterday's cumulative table with today's daily data and build our metric arrays for each user. This allows us to bring the new history in without having to scan all of it. (a big performance boost)

These metric arrays allow us to easily answer queries about the history of all users using things like ARRAY_SUM to calculate whatever metric we want on whatever time frame the array allows.

The longer the time frame of your analysis, the more critical this pattern becomes!!

Example User activity and engagement cumulated

All query syntax is using Presto/Trino syntax and functions. This example would need to be modified for other SQL variants!

We'll be using the dates:

  • 2022-01-01 as today in Airflow terms this is {{ ds }}
  • 2021-12-31 as yesterday in Airflow templating terms this is {{ yesterday_ds}}

In this example, we'll be looking into how to build this design for calculate daily, weekly and monthly active users as well as the users likes, comments, and shares.

Our source table in this case is events.

It's tempting to think the solution to this is running a pipeline something like

    SELECT 
        COUNT(DISTINCT user_id) as num_monthly_active_users,
        COUNT(CASE WHEN event_type = 'like' THEN 1 END) as num_likes_30d,
        COUNT(CASE WHEN event_type = 'comment' THEN 1 END) as num_comments_30d,
        COUNT(CASE WHEN event_type = 'share' THEN 1 END) as num_shares_30d,
        ...
    FROM events
    WHERE event_date BETWEEN DATE_SUB('2022-01-01', 30), AND '2022-01-01'

The problem with this is we're scanning 30 days of event data every day to produce these numbers. A pretty wasteful, yet simple pipeline. There should be a way where we only have to scan the event data once and combine with the results from the previous 29 days, right? Can we make a data structure where a data scientist can query our data and easily know the number of actions a user took in the last N number of days?

This design is pretty simple with only 3 steps:

The Daily table step