Sampling support in querying

macobo commented 1 year ago

When I am exploring data, I want to see results in the fastest way possible.

One way to achieve this would be to sample data. ClickHouse supports sampling via SAMPLE BY clause.

When sampled, queries would ideally return data in the same shape as the main dataset but lose precision due to looking at less people.

Implementation notes / Open topics

What to sample by?

The current table is set up to sample by distinct_id. While we could sample by event, for funnels/paths it makes more sense to sample by person_id.

Setting up the schema for sampling

To sample by person_id, we'd need to re-migrate our main dataset to include person_id in table ORDER BY. This is a heavy migration similar to 0002_events_sample_by

This might be a good spot to make other changes to the ORDER BY as well.

UX for querying

There needs to be a toggle to allow toggling sampling off and on in querying.

Ideally for large customers when building insights sampling would always be on when querying large time windows, but after saving the insight or adding it to the dashboard we should make it turn off.

The fact data is sampled should also be clearly indicated in the UI.

Setting sampling rate

We'll need a system for figuring out the default sampling rate for a given team.

yakkomajuri commented 1 year ago

I've picked this task up now. Here are my initial thoughts:

Present vs. future

Currently our events table has SAMPLE BY distinct_id. This means sampling will work best/only for teams that deal with only one type of user (i.e. anonymous or identified but not both)
In the future we should sample by person_id. Unsure about the work needed to change a SAMPLE BY clause at this point.

Benchmarking

Sampling definitely speeds things up a lot (duh!) so I think we really should invest in it
From my benchmarking, it can speed up things more than linearly i.e. sampling by 10% can make things over 10x faster because of the compounding effect of sorts, groupings, joins, etc down the line. However, from my quick benchmarking (3 queries for our largest team), sampling by a factor of 0.1 gave performance benefits in the order of 1.5-6x speed improvement. 10% sampling was off by around ~1% on average for retention and DAUs with the largest difference for a data point being ~3%. For a lifecycle query though 2 data points differed by about 12%. Need to do further testing to find the optimal sampling rate (overall? per team? per insight?). We can also consider using SAMPLE n instead of SAMPLE k for appropriate team-level factors. Btw, my suggestion is to test this in prod (behind a feature flag with some opt-in teams ofc) as that'll give us the most/best data.

UX

Initially at least let's make this toggle-able at the insight level without any stored state
Later there are two potential approaches. Note that it will become important to be able to save state to prevent us from doing unnecessary querying.
1. Make this a team-level setting
2. Make this a saved insight-level setting
Initial suggestion: just under the filters on the right

Query layer

Sampling will return results for a sampling factor s e.g. 0.1 and we then need to multiply the results by 1/s. I suggest we do this in Python rather than ClickHouse, as it's logic that is much better suited on that layer.
We might not be able to turn on sampling for all types of insights, so need to consider our queries carefully. Math operations are particularly problematic. Notes on this:
- avg: doesn't need any post-processing
- min/max: should not allow sampling
- median/p90/p95/p99: we should probably not allow sampling + the current approach using quantile is anyway effectively sampled
- sum: multiply by 1/s as standard
Given all this handling, we might just turn off sampling for math operations as a start? Either way need to consider things carefully when implementing it
Need to consider how best to test sampling

Complementary queries

In many cases we have queries that complement each other. The best example of this is the persons modal. Question: if the main query is sampled, should we sample the others as well? We can probably punt on this and leave the supplementary queries unsampled as they should run fewer times than the main query anyway.
What about multiple series where one doesn't support sampling? Probably ok to just "sample what we can" in any case when someone turns on sampling

mariusandra commented 1 year ago

Very nice! One more thing waiting for person_id 😅

UX wise, the way I've been imagining a sampling toggle is a slider with a logarithmic scale. I assume you'd want to sample data only if the regular queries are too slow, and in that case the sampling rate you'd choose will greatly depend on how much data you have, what query you're running, etc. You'd choose whichever rate to make your query complete in $reasonable_timerange.

I imagine users going "my query is too slow, let's trade accuracy for speed by moving the sampling slider".

yakkomajuri commented 1 year ago

RE changing the SAMPLE BY clause of events:

It's an easy operation in itself
But it depends on person_id being in the sorting key, which requires an async migration

https://clickhouse.com/docs/en/sql-reference/statements/alter/sample-by/

neilkakkar commented 1 year ago

Not sure if you've also seen the linked issue: https://github.com/PostHog/posthog/issues/12909 -

PostHog / posthog