🦔 PostHog provides open-source web & product analytics, session recording, feature flagging and A/B testing that you can self-host. Get started - free.
Whenever a graph containing 365 days worth of data is recalculated, we run the whole calculation over the whole time range.
We could instead leverage past results to speed up queries on dashboards.
Describe the solution you'd like
Note: This project depends on person-on-events for both person_id and person properties to be immutable across time.
Create a new function similar to execute_timeseries_query(query, filter, now, window_time=timedelta(days=0), buffer_time=timedelta(days=1))
This function would:
Load previous partial results from redis if any
Modify the query to run over time period without any results
Combine cached and clickhouse-fetched data
Save results to redis
Return response
Details of a possible implementation
Cache/result format
Response is a List[TimeSeries]
@dataclass
class TimeSeries:
values: List[float]
dates: List[str]
breakdown_value: Any
Note that zero values are filled in in values/dates.
Determining date range to run query on
Multiple components should be taken into account when deciding what time range to run query on:
Person time-to-merge (P): Amount of time we allow for users to change their person_id. We can’t cache anything during this period if query uses person_id anyhow. Instance setting.
Ingestion allowance (I): By default we should not cache anything from the running hour/day. Instance setting.
Conversion window (C): Only set for weekly/monthly active users queries. We extend the time period by up to C to make sure conversions are properly accounted for. Passed into the module depending on the callsite.
As such, assuming a query for [Start, End] time range with cached data between [CStart, CEnd], we run query in range [QStart, End] where
QStart = max(Start, CEnd - max(P, I) - C)
Note that if C is set, some results need to be discarded/cached results preferred.
Modifying the query
This can be done in two ways:
Leveraging the clickhouse SQL parsing built into
Relying on standard parameter naming patterns (e.g. date_from, date_to) and modifying these accordingly
Cache keys
Cache key calculations use normal filter hash methods as used in overall results caching.
Combining results
In most cases, combining results is done by summing values from appropriate dates with each other
Analysis of specifics
Simple trend queries
SELECT toUInt16(0) AS total,
toStartOfDay(toDateTime('2023-02-13 23:59:59', 'UTC') - toIntervalDay(number)) AS day_start
FROM numbers(dateDiff('day', toStartOfDay(toDateTime('2023-02-06 00:00:00', 'UTC')), toDateTime('2023-02-13 23:59:59', 'UTC')))
UNION ALL SELECT toUInt16(0) AS total,
toStartOfDay(toDateTime('2023-02-06 00:00:00', 'UTC'))
UNION ALL SELECT count(*) as data,
toStartOfDay(toTimeZone(toDateTime(timestamp, 'UTC'), 'UTC')) as date
FROM events e
WHERE team_id = 1
AND event = '$pageview'
AND toTimeZone(timestamp, 'UTC') >= toDateTime(toStartOfDay(toDateTime('2023-02-06 00:00:00', 'UTC')), 'UTC')
AND toTimeZone(timestamp, 'UTC') <= toDateTime('2023-02-13 23:59:59', 'UTC')
GROUP BY date
This query can be easily combined with partial results.
Note:
This removes dates/values-to-arrays behavior from the current query. This would be best handled at the application layer since we’d need to construct dataclasses to combine objects.
We should consider using WITH FILL or handling things in application layer over multiple UNIONs to handle filling-in-zeroes to make the resulting SQL easier to parse.
Weekly and monthly active users
Click to see the current WAU query
```sql
SELECT groupArray(day_start) as date,
groupArray(count) as data
FROM (
SELECT SUM(total) AS count,
day_start
FROM (
SELECT toUInt16(0) AS total,
toStartOfDay(toDateTime('2023-02-13 23:59:59', 'UTC') - toIntervalDay(number)) AS day_start
FROM numbers(dateDiff('day', toStartOfDay(toDateTime('2023-02-06 00:00:00', 'UTC')), toDateTime('2023-02-13 23:59:59', 'UTC')))
UNION ALL SELECT toUInt16(0) AS total,
toStartOfDay(toDateTime('2023-02-06 00:00:00', 'UTC'))
UNION ALL WITH toDateTime('2023-02-13 23:59:59', 'UTC') AS date_to,
toDateTime('2023-02-06 00:00:00', 'UTC') AS date_from,
arrayMap(n -> toStartOfDay(toDateTime(n, 'UTC')), range(toUInt32(toDateTime(toStartOfDay(toDateTime('2023-02-06 00:00:00', 'UTC')), 'UTC')), toUInt32(date_to), 86400)) AS buckets
SELECT counts AS total,
timestamp AS day_start
FROM (
SELECT count(DISTINCT actor_id) AS counts,
toStartOfDay(arrayJoin(event_buckets)) as timestamp
FROM (
SELECT pdi.person_id as actor_id,
toTimeZone(timestamp, 'UTC') as tz_adjusted_timestamp,
arrayMap(n -> toDateTime(n, 'UTC'), range(toUInt32(toDateTime(toStartOfDay(if(greater(tz_adjusted_timestamp, date_from), tz_adjusted_timestamp, date_from)))), toUInt32(if(greater(tz_adjusted_timestamp, date_to), date_to, tz_adjusted_timestamp) + INTERVAL 6 DAY), 86400)) AS event_buckets
FROM events e
INNER JOIN (
SELECT distinct_id,
argMax(person_id, version) as person_id
FROM person_distinct_id2
WHERE team_id = 1
GROUP BY distinct_id
HAVING argMax(is_deleted, version) = 0
) AS pdi
ON e.distinct_id = pdi.distinct_id
WHERE team_id = 1
AND event = '$pageview'
AND toDateTime(timestamp, 'UTC') >= toDateTime('2023-01-30 00:00:00', 'UTC')
AND toTimeZone(timestamp, 'UTC') <= toDateTime('2023-02-13 23:59:59', 'UTC')
GROUP BY pdi.person_id,
tz_adjusted_timestamp
)
GROUP BY timestamp
HAVING has(buckets, timestamp)
ORDER BY timestamp
)
)
GROUP BY day_start
ORDER BY day_start
)
```
For a WAU query, Conversion window (C) would be 7 days. The results of the first 7 days of the query would then be discarded as they were only queried to get accurate numbers.
Additional proposed changes to make it cachable:
Remove the unneccessary nesting
Simplify filling logic
Trend formula queries
Formula queries as-is mostly work nicely with partial caching.
However they don’t get the full benefit of partial caching if the formula is being edited. It might perform better if each series was calculated separately in clickhouse and combined in-app rather than in one large sql query.
Trend breakdowns
Trend breakdown queries do two clickhouse queries under the hood:
Getting the top N values for breakdown key(s) in time range
Getting time series per top N values in time range
Query (1) makes this mostly incompatible with partial caching as the top N might change significantly as the window widens. Solving this requires making product semantic changes.
Trends number and pie graphs
Naive count queries and pie graphs for trends can be partially cached as the results can be aggregated from each day.
However if aggregations such as DAUs, medians or percentiles are involved, these graphs cannot be partially cached.
Funnels
Click to see the current funnels query
```sql
SELECT countIf(steps = 1) step_1,
countIf(steps = 2) step_2,
countIf(steps = 3) step_3,
avg(step_1_average_conversion_time_inner) step_1_average_conversion_time,
avg(step_2_average_conversion_time_inner) step_2_average_conversion_time,
median(step_1_median_conversion_time_inner) step_1_median_conversion_time,
median(step_2_median_conversion_time_inner) step_2_median_conversion_time
FROM (
SELECT aggregation_target,
steps,
avg(step_1_conversion_time) step_1_average_conversion_time_inner,
avg(step_2_conversion_time) step_2_average_conversion_time_inner,
median(step_1_conversion_time) step_1_median_conversion_time_inner,
median(step_2_conversion_time) step_2_median_conversion_time_inner
FROM (
SELECT aggregation_target,
steps,
max(steps) over (PARTITION BY aggregation_target) as max_steps,
step_1_conversion_time,
step_2_conversion_time
FROM (
SELECT *,
if(latest_0 < latest_1 AND latest_1 <= latest_0 + INTERVAL 15 DAY AND latest_1 < latest_2 AND latest_2 <= latest_0 + INTERVAL 15 DAY, 3, if(latest_0 < latest_1 AND latest_1 <= latest_0 + INTERVAL 15 DAY, 2, 1)) AS steps ,
if(isNotNull(latest_1) AND latest_1 <= latest_0 + INTERVAL 15 DAY, dateDiff('second', toDateTime(latest_0), toDateTime(latest_1)), NULL) step_1_conversion_time,
if(isNotNull(latest_2) AND latest_2 <= latest_1 + INTERVAL 15 DAY, dateDiff('second', toDateTime(latest_1), toDateTime(latest_2)), NULL) step_2_conversion_time
FROM (
SELECT aggregation_target,
timestamp,
step_0,
latest_0,
step_1,
latest_1,
step_2,
min(latest_2) over (PARTITION by aggregation_target ORDER BY timestamp DESC ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) latest_2
FROM (
SELECT aggregation_target,
timestamp,
step_0,
latest_0,
step_1,
latest_1,
step_2,
if(latest_2 < latest_1, NULL, latest_2) as latest_2
FROM (
SELECT aggregation_target,
timestamp,
step_0,
latest_0,
step_1,
min(latest_1) over (PARTITION by aggregation_target ORDER BY timestamp DESC ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) latest_1,
step_2,
min(latest_2) over (PARTITION by aggregation_target ORDER BY timestamp DESC ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) latest_2
FROM (
SELECT e.timestamp as timestamp,
pdi.person_id as aggregation_target,
pdi.person_id as person_id ,
if(event = '$pageview', 1, 0) as step_0,
if(step_0 = 1, timestamp, null) as latest_0,
if(event = '$pageview', 1, 0) as step_1,
if(step_1 = 1, timestamp, null) as latest_1,
if(event = '$pageview', 1, 0) as step_2,
if(step_2 = 1, timestamp, null) as latest_2
FROM events e
INNER JOIN (
SELECT distinct_id,
argMax(person_id, version) as person_id
FROM person_distinct_id2
WHERE team_id = 1
GROUP BY distinct_id
HAVING argMax(is_deleted, version) = 0
) AS pdi
ON e.distinct_id = pdi.distinct_id
WHERE team_id = 1
AND event IN ['$pageview']
AND toTimeZone(timestamp, 'UTC') >= toDateTime('2023-02-07 00:00:00', 'UTC')
AND toTimeZone(timestamp, 'UTC') <= toDateTime('2023-02-14 23:59:59', 'UTC')
AND (step_0 = 1 OR step_1 = 1 OR step_2 = 1)
)
)
)
)
WHERE step_0 = 1
)
)
GROUP BY aggregation_target,
steps
HAVING steps = max_steps
)
```
Funnels sadly won’t be able to use this mechanism while retaining the current semantics. We ensure that a given person is counted only once over the whole time range even if they converted multiple times in different conversion windows.
If we wanted to make funnels use partial caching, this would require changing these semantics to allow persons to convert multiple times (in non-overlapping conversion windows). If so, we could calculate coalesce(latest_2, latest_1, lastest_0) AS date_to_consider which could be used as a baseline for doing calculations.
Corner cases
Cohort filters
Since cohorts change over time, any query filtering or breaking down on cohorts is not partially cachable.
Action definition(s)
We currently use action_id when calculating filter_hash. This means that changing the definition of an action can change the meaning of a query but won’t
Backfilling data
Users may send us data in the past as part of backfills.
I recommend:
Clearly indicating in-ui when cached data is used with tooltips
Providing an option in the UI to wipe query cache, referencing this in the tooltip
API option that is documented to wipe the cache.
Describe alternatives you've considered
Building this into trends/funnel building explicitly (instead of a separate layer). However the business logic within is already really complicated and it makes it hard to 'extend' this solution onto other features in the future.
Is your feature request related to a problem?
Whenever a graph containing 365 days worth of data is recalculated, we run the whole calculation over the whole time range.
We could instead leverage past results to speed up queries on dashboards.
Describe the solution you'd like
Create a new function similar to
execute_timeseries_query(query, filter, now, window_time=timedelta(days=0), buffer_time=timedelta(days=1))
This function would:
Details of a possible implementation
Cache/result format
Response is a
List[TimeSeries]
Note that zero values are filled in in values/dates.
Determining date range to run query on Multiple components should be taken into account when deciding what time range to run query on:
As such, assuming a query for
[Start, End]
time range with cached data between[CStart, CEnd]
, we run query in range [QStart, End] whereQStart = max(Start, CEnd - max(P, I) - C)
Note that if C is set, some results need to be discarded/cached results preferred.
Modifying the query
This can be done in two ways:
Cache keys
Cache key calculations use normal filter hash methods as used in overall results caching.
Combining results
In most cases, combining results is done by summing values from appropriate dates with each other
Analysis of specifics
Simple trend queries
This query can be easily combined with partial results.
Note:
WITH FILL
or handling things in application layer over multiple UNIONs to handle filling-in-zeroes to make the resulting SQL easier to parse.Weekly and monthly active users
Click to see the current WAU query
```sql SELECT groupArray(day_start) as date, groupArray(count) as data FROM ( SELECT SUM(total) AS count, day_start FROM ( SELECT toUInt16(0) AS total, toStartOfDay(toDateTime('2023-02-13 23:59:59', 'UTC') - toIntervalDay(number)) AS day_start FROM numbers(dateDiff('day', toStartOfDay(toDateTime('2023-02-06 00:00:00', 'UTC')), toDateTime('2023-02-13 23:59:59', 'UTC'))) UNION ALL SELECT toUInt16(0) AS total, toStartOfDay(toDateTime('2023-02-06 00:00:00', 'UTC')) UNION ALL WITH toDateTime('2023-02-13 23:59:59', 'UTC') AS date_to, toDateTime('2023-02-06 00:00:00', 'UTC') AS date_from, arrayMap(n -> toStartOfDay(toDateTime(n, 'UTC')), range(toUInt32(toDateTime(toStartOfDay(toDateTime('2023-02-06 00:00:00', 'UTC')), 'UTC')), toUInt32(date_to), 86400)) AS buckets SELECT counts AS total, timestamp AS day_start FROM ( SELECT count(DISTINCT actor_id) AS counts, toStartOfDay(arrayJoin(event_buckets)) as timestamp FROM ( SELECT pdi.person_id as actor_id, toTimeZone(timestamp, 'UTC') as tz_adjusted_timestamp, arrayMap(n -> toDateTime(n, 'UTC'), range(toUInt32(toDateTime(toStartOfDay(if(greater(tz_adjusted_timestamp, date_from), tz_adjusted_timestamp, date_from)))), toUInt32(if(greater(tz_adjusted_timestamp, date_to), date_to, tz_adjusted_timestamp) + INTERVAL 6 DAY), 86400)) AS event_buckets FROM events e INNER JOIN ( SELECT distinct_id, argMax(person_id, version) as person_id FROM person_distinct_id2 WHERE team_id = 1 GROUP BY distinct_id HAVING argMax(is_deleted, version) = 0 ) AS pdi ON e.distinct_id = pdi.distinct_id WHERE team_id = 1 AND event = '$pageview' AND toDateTime(timestamp, 'UTC') >= toDateTime('2023-01-30 00:00:00', 'UTC') AND toTimeZone(timestamp, 'UTC') <= toDateTime('2023-02-13 23:59:59', 'UTC') GROUP BY pdi.person_id, tz_adjusted_timestamp ) GROUP BY timestamp HAVING has(buckets, timestamp) ORDER BY timestamp ) ) GROUP BY day_start ORDER BY day_start ) ```For a WAU query, Conversion window (C) would be 7 days. The results of the first 7 days of the query would then be discarded as they were only queried to get accurate numbers.
Additional proposed changes to make it cachable:
Trend formula queries
Formula queries as-is mostly work nicely with partial caching.
However they don’t get the full benefit of partial caching if the formula is being edited. It might perform better if each series was calculated separately in clickhouse and combined in-app rather than in one large sql query.
Trend breakdowns
Trend breakdown queries do two clickhouse queries under the hood:
Query (1) makes this mostly incompatible with partial caching as the top N might change significantly as the window widens. Solving this requires making product semantic changes.
Trends number and pie graphs
Naive count queries and pie graphs for trends can be partially cached as the results can be aggregated from each day.
However if aggregations such as DAUs, medians or percentiles are involved, these graphs cannot be partially cached.
Funnels
Click to see the current funnels query
```sql SELECT countIf(steps = 1) step_1, countIf(steps = 2) step_2, countIf(steps = 3) step_3, avg(step_1_average_conversion_time_inner) step_1_average_conversion_time, avg(step_2_average_conversion_time_inner) step_2_average_conversion_time, median(step_1_median_conversion_time_inner) step_1_median_conversion_time, median(step_2_median_conversion_time_inner) step_2_median_conversion_time FROM ( SELECT aggregation_target, steps, avg(step_1_conversion_time) step_1_average_conversion_time_inner, avg(step_2_conversion_time) step_2_average_conversion_time_inner, median(step_1_conversion_time) step_1_median_conversion_time_inner, median(step_2_conversion_time) step_2_median_conversion_time_inner FROM ( SELECT aggregation_target, steps, max(steps) over (PARTITION BY aggregation_target) as max_steps, step_1_conversion_time, step_2_conversion_time FROM ( SELECT *, if(latest_0 < latest_1 AND latest_1 <= latest_0 + INTERVAL 15 DAY AND latest_1 < latest_2 AND latest_2 <= latest_0 + INTERVAL 15 DAY, 3, if(latest_0 < latest_1 AND latest_1 <= latest_0 + INTERVAL 15 DAY, 2, 1)) AS steps , if(isNotNull(latest_1) AND latest_1 <= latest_0 + INTERVAL 15 DAY, dateDiff('second', toDateTime(latest_0), toDateTime(latest_1)), NULL) step_1_conversion_time, if(isNotNull(latest_2) AND latest_2 <= latest_1 + INTERVAL 15 DAY, dateDiff('second', toDateTime(latest_1), toDateTime(latest_2)), NULL) step_2_conversion_time FROM ( SELECT aggregation_target, timestamp, step_0, latest_0, step_1, latest_1, step_2, min(latest_2) over (PARTITION by aggregation_target ORDER BY timestamp DESC ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) latest_2 FROM ( SELECT aggregation_target, timestamp, step_0, latest_0, step_1, latest_1, step_2, if(latest_2 < latest_1, NULL, latest_2) as latest_2 FROM ( SELECT aggregation_target, timestamp, step_0, latest_0, step_1, min(latest_1) over (PARTITION by aggregation_target ORDER BY timestamp DESC ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) latest_1, step_2, min(latest_2) over (PARTITION by aggregation_target ORDER BY timestamp DESC ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) latest_2 FROM ( SELECT e.timestamp as timestamp, pdi.person_id as aggregation_target, pdi.person_id as person_id , if(event = '$pageview', 1, 0) as step_0, if(step_0 = 1, timestamp, null) as latest_0, if(event = '$pageview', 1, 0) as step_1, if(step_1 = 1, timestamp, null) as latest_1, if(event = '$pageview', 1, 0) as step_2, if(step_2 = 1, timestamp, null) as latest_2 FROM events e INNER JOIN ( SELECT distinct_id, argMax(person_id, version) as person_id FROM person_distinct_id2 WHERE team_id = 1 GROUP BY distinct_id HAVING argMax(is_deleted, version) = 0 ) AS pdi ON e.distinct_id = pdi.distinct_id WHERE team_id = 1 AND event IN ['$pageview'] AND toTimeZone(timestamp, 'UTC') >= toDateTime('2023-02-07 00:00:00', 'UTC') AND toTimeZone(timestamp, 'UTC') <= toDateTime('2023-02-14 23:59:59', 'UTC') AND (step_0 = 1 OR step_1 = 1 OR step_2 = 1) ) ) ) ) WHERE step_0 = 1 ) ) GROUP BY aggregation_target, steps HAVING steps = max_steps ) ```Funnels sadly won’t be able to use this mechanism while retaining the current semantics. We ensure that a given person is counted only once over the whole time range even if they converted multiple times in different conversion windows.
If we wanted to make funnels use partial caching, this would require changing these semantics to allow persons to convert multiple times (in non-overlapping conversion windows). If so, we could calculate
coalesce(latest_2, latest_1, lastest_0) AS date_to_consider
which could be used as a baseline for doing calculations.Corner cases
Cohort filters
Since cohorts change over time, any query filtering or breaking down on cohorts is not partially cachable.
Action definition(s)
We currently use action_id when calculating filter_hash. This means that changing the definition of an action can change the meaning of a query but won’t
Backfilling data
Users may send us data in the past as part of backfills.
I recommend:
Describe alternatives you've considered
Building this into trends/funnel building explicitly (instead of a separate layer). However the business logic within is already really complicated and it makes it hard to 'extend' this solution onto other features in the future.
Additional context
cc @yakkomajuri