PostHog / posthog

🦔 PostHog provides open-source web & product analytics, session recording, feature flagging and A/B testing that you can self-host. Get started - free.
https://posthog.com
Other
21.97k stars 1.32k forks source link

Partial caching of time-series results #14252

Open macobo opened 1 year ago

macobo commented 1 year ago

Is your feature request related to a problem?

Whenever a graph containing 365 days worth of data is recalculated, we run the whole calculation over the whole time range.

We could instead leverage past results to speed up queries on dashboards.

Describe the solution you'd like

Note: This project depends on person-on-events for both person_id and person properties to be immutable across time.

Create a new function similar to execute_timeseries_query(query, filter, now, window_time=timedelta(days=0), buffer_time=timedelta(days=1))

This function would:

  1. Load previous partial results from redis if any
  2. Modify the query to run over time period without any results
  3. Combine cached and clickhouse-fetched data
  4. Save results to redis
  5. Return response

Details of a possible implementation

Cache/result format

Response is a List[TimeSeries]

@dataclass
class TimeSeries:

    values: List[float]
    dates: List[str]
    breakdown_value: Any

Note that zero values are filled in in values/dates.

Determining date range to run query on Multiple components should be taken into account when deciding what time range to run query on:

  1. Person time-to-merge (P): Amount of time we allow for users to change their person_id. We can’t cache anything during this period if query uses person_id anyhow. Instance setting.
  2. Ingestion allowance (I): By default we should not cache anything from the running hour/day. Instance setting.
  3. Conversion window (C): Only set for weekly/monthly active users queries. We extend the time period by up to C to make sure conversions are properly accounted for. Passed into the module depending on the callsite.

As such, assuming a query for [Start, End] time range with cached data between [CStart, CEnd], we run query in range [QStart, End] where

QStart = max(Start, CEnd - max(P, I) - C)

Note that if C is set, some results need to be discarded/cached results preferred.

Modifying the query

This can be done in two ways:

  1. Leveraging the clickhouse SQL parsing built into
  2. Relying on standard parameter naming patterns (e.g. date_from, date_to) and modifying these accordingly

Cache keys

Cache key calculations use normal filter hash methods as used in overall results caching.

Combining results

In most cases, combining results is done by summing values from appropriate dates with each other

Analysis of specifics

Simple trend queries

    SELECT toUInt16(0) AS total,
         toStartOfDay(toDateTime('2023-02-13 23:59:59', 'UTC') - toIntervalDay(number)) AS day_start
    FROM numbers(dateDiff('day', toStartOfDay(toDateTime('2023-02-06 00:00:00', 'UTC')), toDateTime('2023-02-13 23:59:59', 'UTC')))
    UNION ALL SELECT toUInt16(0) AS total,
         toStartOfDay(toDateTime('2023-02-06 00:00:00', 'UTC'))
    UNION ALL SELECT count(*) as data,
         toStartOfDay(toTimeZone(toDateTime(timestamp, 'UTC'), 'UTC')) as date
    FROM events e
    WHERE team_id = 1
    AND event = '$pageview'
    AND toTimeZone(timestamp, 'UTC') >= toDateTime(toStartOfDay(toDateTime('2023-02-06 00:00:00', 'UTC')), 'UTC')
    AND toTimeZone(timestamp, 'UTC') <= toDateTime('2023-02-13 23:59:59', 'UTC')
    GROUP BY date

This query can be easily combined with partial results.

Note:

Weekly and monthly active users

Click to see the current WAU query ```sql SELECT groupArray(day_start) as date, groupArray(count) as data FROM ( SELECT SUM(total) AS count, day_start FROM ( SELECT toUInt16(0) AS total, toStartOfDay(toDateTime('2023-02-13 23:59:59', 'UTC') - toIntervalDay(number)) AS day_start FROM numbers(dateDiff('day', toStartOfDay(toDateTime('2023-02-06 00:00:00', 'UTC')), toDateTime('2023-02-13 23:59:59', 'UTC'))) UNION ALL SELECT toUInt16(0) AS total, toStartOfDay(toDateTime('2023-02-06 00:00:00', 'UTC')) UNION ALL WITH toDateTime('2023-02-13 23:59:59', 'UTC') AS date_to, toDateTime('2023-02-06 00:00:00', 'UTC') AS date_from, arrayMap(n -> toStartOfDay(toDateTime(n, 'UTC')), range(toUInt32(toDateTime(toStartOfDay(toDateTime('2023-02-06 00:00:00', 'UTC')), 'UTC')), toUInt32(date_to), 86400)) AS buckets SELECT counts AS total, timestamp AS day_start FROM ( SELECT count(DISTINCT actor_id) AS counts, toStartOfDay(arrayJoin(event_buckets)) as timestamp FROM ( SELECT pdi.person_id as actor_id, toTimeZone(timestamp, 'UTC') as tz_adjusted_timestamp, arrayMap(n -> toDateTime(n, 'UTC'), range(toUInt32(toDateTime(toStartOfDay(if(greater(tz_adjusted_timestamp, date_from), tz_adjusted_timestamp, date_from)))), toUInt32(if(greater(tz_adjusted_timestamp, date_to), date_to, tz_adjusted_timestamp) + INTERVAL 6 DAY), 86400)) AS event_buckets FROM events e INNER JOIN ( SELECT distinct_id, argMax(person_id, version) as person_id FROM person_distinct_id2 WHERE team_id = 1 GROUP BY distinct_id HAVING argMax(is_deleted, version) = 0 ) AS pdi ON e.distinct_id = pdi.distinct_id WHERE team_id = 1 AND event = '$pageview' AND toDateTime(timestamp, 'UTC') >= toDateTime('2023-01-30 00:00:00', 'UTC') AND toTimeZone(timestamp, 'UTC') <= toDateTime('2023-02-13 23:59:59', 'UTC') GROUP BY pdi.person_id, tz_adjusted_timestamp ) GROUP BY timestamp HAVING has(buckets, timestamp) ORDER BY timestamp ) ) GROUP BY day_start ORDER BY day_start ) ```

For a WAU query, Conversion window (C) would be 7 days. The results of the first 7 days of the query would then be discarded as they were only queried to get accurate numbers.

Additional proposed changes to make it cachable:

Trend formula queries

Formula queries as-is mostly work nicely with partial caching.

However they don’t get the full benefit of partial caching if the formula is being edited. It might perform better if each series was calculated separately in clickhouse and combined in-app rather than in one large sql query.

Trend breakdowns

Trend breakdown queries do two clickhouse queries under the hood:

  1. Getting the top N values for breakdown key(s) in time range
  2. Getting time series per top N values in time range

Query (1) makes this mostly incompatible with partial caching as the top N might change significantly as the window widens. Solving this requires making product semantic changes.

Trends number and pie graphs

Naive count queries and pie graphs for trends can be partially cached as the results can be aggregated from each day.

However if aggregations such as DAUs, medians or percentiles are involved, these graphs cannot be partially cached.

Funnels

Click to see the current funnels query ```sql SELECT countIf(steps = 1) step_1, countIf(steps = 2) step_2, countIf(steps = 3) step_3, avg(step_1_average_conversion_time_inner) step_1_average_conversion_time, avg(step_2_average_conversion_time_inner) step_2_average_conversion_time, median(step_1_median_conversion_time_inner) step_1_median_conversion_time, median(step_2_median_conversion_time_inner) step_2_median_conversion_time FROM ( SELECT aggregation_target, steps, avg(step_1_conversion_time) step_1_average_conversion_time_inner, avg(step_2_conversion_time) step_2_average_conversion_time_inner, median(step_1_conversion_time) step_1_median_conversion_time_inner, median(step_2_conversion_time) step_2_median_conversion_time_inner FROM ( SELECT aggregation_target, steps, max(steps) over (PARTITION BY aggregation_target) as max_steps, step_1_conversion_time, step_2_conversion_time FROM ( SELECT *, if(latest_0 < latest_1 AND latest_1 <= latest_0 + INTERVAL 15 DAY AND latest_1 < latest_2 AND latest_2 <= latest_0 + INTERVAL 15 DAY, 3, if(latest_0 < latest_1 AND latest_1 <= latest_0 + INTERVAL 15 DAY, 2, 1)) AS steps , if(isNotNull(latest_1) AND latest_1 <= latest_0 + INTERVAL 15 DAY, dateDiff('second', toDateTime(latest_0), toDateTime(latest_1)), NULL) step_1_conversion_time, if(isNotNull(latest_2) AND latest_2 <= latest_1 + INTERVAL 15 DAY, dateDiff('second', toDateTime(latest_1), toDateTime(latest_2)), NULL) step_2_conversion_time FROM ( SELECT aggregation_target, timestamp, step_0, latest_0, step_1, latest_1, step_2, min(latest_2) over (PARTITION by aggregation_target ORDER BY timestamp DESC ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) latest_2 FROM ( SELECT aggregation_target, timestamp, step_0, latest_0, step_1, latest_1, step_2, if(latest_2 < latest_1, NULL, latest_2) as latest_2 FROM ( SELECT aggregation_target, timestamp, step_0, latest_0, step_1, min(latest_1) over (PARTITION by aggregation_target ORDER BY timestamp DESC ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) latest_1, step_2, min(latest_2) over (PARTITION by aggregation_target ORDER BY timestamp DESC ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) latest_2 FROM ( SELECT e.timestamp as timestamp, pdi.person_id as aggregation_target, pdi.person_id as person_id , if(event = '$pageview', 1, 0) as step_0, if(step_0 = 1, timestamp, null) as latest_0, if(event = '$pageview', 1, 0) as step_1, if(step_1 = 1, timestamp, null) as latest_1, if(event = '$pageview', 1, 0) as step_2, if(step_2 = 1, timestamp, null) as latest_2 FROM events e INNER JOIN ( SELECT distinct_id, argMax(person_id, version) as person_id FROM person_distinct_id2 WHERE team_id = 1 GROUP BY distinct_id HAVING argMax(is_deleted, version) = 0 ) AS pdi ON e.distinct_id = pdi.distinct_id WHERE team_id = 1 AND event IN ['$pageview'] AND toTimeZone(timestamp, 'UTC') >= toDateTime('2023-02-07 00:00:00', 'UTC') AND toTimeZone(timestamp, 'UTC') <= toDateTime('2023-02-14 23:59:59', 'UTC') AND (step_0 = 1 OR step_1 = 1 OR step_2 = 1) ) ) ) ) WHERE step_0 = 1 ) ) GROUP BY aggregation_target, steps HAVING steps = max_steps ) ```

Funnels sadly won’t be able to use this mechanism while retaining the current semantics. We ensure that a given person is counted only once over the whole time range even if they converted multiple times in different conversion windows.

If we wanted to make funnels use partial caching, this would require changing these semantics to allow persons to convert multiple times (in non-overlapping conversion windows). If so, we could calculate coalesce(latest_2, latest_1, lastest_0) AS date_to_consider which could be used as a baseline for doing calculations.

Corner cases

Cohort filters

Since cohorts change over time, any query filtering or breaking down on cohorts is not partially cachable.

Action definition(s)

We currently use action_id when calculating filter_hash. This means that changing the definition of an action can change the meaning of a query but won’t

Backfilling data

Users may send us data in the past as part of backfills.

I recommend:

  1. Clearly indicating in-ui when cached data is used with tooltips
  2. Providing an option in the UI to wipe query cache, referencing this in the tooltip
  3. API option that is documented to wipe the cache.

Describe alternatives you've considered

Building this into trends/funnel building explicitly (instead of a separate layer). However the business logic within is already really complicated and it makes it hard to 'extend' this solution onto other features in the future.

Additional context

cc @yakkomajuri

neilkakkar commented 1 year ago

@EDsCODE built part of this some time ago, feel free to delete / re-use / work-on-top of: https://github.com/PostHog/posthog/pull/10267