Open macobo opened 1 year ago
I've picked this task up now. Here are my initial thoughts:
SAMPLE BY distinct_id
. This means sampling will work best/only for teams that deal with only one type of user (i.e. anonymous or identified but not both)person_id
. Unsure about the work needed to change a SAMPLE BY
clause at this point.SAMPLE n
instead of SAMPLE k
for appropriate team-level factors. Btw, my suggestion is to test this in prod (behind a feature flag with some opt-in teams ofc) as that'll give us the most/best data.quantile
is anyway effectively sampledVery nice! One more thing waiting for person_id
😅
UX wise, the way I've been imagining a sampling toggle is a slider with a logarithmic scale. I assume you'd want to sample data only if the regular queries are too slow, and in that case the sampling rate you'd choose will greatly depend on how much data you have, what query you're running, etc. You'd choose whichever rate to make your query complete in $reasonable_timerange
.
I imagine users going "my query is too slow, let's trade accuracy for speed by moving the sampling slider".
RE changing the SAMPLE BY
clause of events:
person_id
being in the sorting key, which requires an async migrationhttps://clickhouse.com/docs/en/sql-reference/statements/alter/sample-by/
Not sure if you've also seen the linked issue: https://github.com/PostHog/posthog/issues/12909 -
When I am exploring data, I want to see results in the fastest way possible.
One way to achieve this would be to sample data. ClickHouse supports sampling via SAMPLE BY clause.
When sampled, queries would ideally return data in the same shape as the main dataset but lose precision due to looking at less people.
Implementation notes / Open topics
What to sample by?
The current table is set up to sample by
distinct_id
. While we could sample byevent
, for funnels/paths it makes more sense to sample byperson_id
.Setting up the schema for sampling
To sample by
person_id
, we'd need to re-migrate our main dataset to includeperson_id
in tableORDER BY
. This is a heavy migration similar to 0002_events_sample_byThis might be a good spot to make other changes to the
ORDER BY
as well.UX for querying
There needs to be a toggle to allow toggling sampling off and on in querying.
Ideally for large customers when building insights sampling would always be on when querying large time windows, but after saving the insight or adding it to the dashboard we should make it turn off.
The fact data is sampled should also be clearly indicated in the UI.
Setting sampling rate
We'll need a system for figuring out the default sampling rate for a given team.