PostHog / posthog

🦔 PostHog provides open-source product analytics, session recording, feature flagging and A/B testing that you can self-host.
https://posthog.com
Other
21.04k stars 1.25k forks source link

Sampling support in querying #12908

Open macobo opened 1 year ago

macobo commented 1 year ago

When I am exploring data, I want to see results in the fastest way possible.

One way to achieve this would be to sample data. ClickHouse supports sampling via SAMPLE BY clause.

When sampled, queries would ideally return data in the same shape as the main dataset but lose precision due to looking at less people.

Implementation notes / Open topics

What to sample by?

The current table is set up to sample by distinct_id. While we could sample by event, for funnels/paths it makes more sense to sample by person_id.

Setting up the schema for sampling

To sample by person_id, we'd need to re-migrate our main dataset to include person_id in table ORDER BY. This is a heavy migration similar to 0002_events_sample_by

This might be a good spot to make other changes to the ORDER BY as well.

UX for querying

There needs to be a toggle to allow toggling sampling off and on in querying.

Ideally for large customers when building insights sampling would always be on when querying large time windows, but after saving the insight or adding it to the dashboard we should make it turn off.

The fact data is sampled should also be clearly indicated in the UI.

Setting sampling rate

We'll need a system for figuring out the default sampling rate for a given team.

yakkomajuri commented 1 year ago

I've picked this task up now. Here are my initial thoughts:

Present vs. future

Benchmarking

UX

Screenshot 2023-02-16 at 15 46 18

Query layer

Complementary queries

mariusandra commented 1 year ago

Very nice! One more thing waiting for person_id 😅

UX wise, the way I've been imagining a sampling toggle is a slider with a logarithmic scale. I assume you'd want to sample data only if the regular queries are too slow, and in that case the sampling rate you'd choose will greatly depend on how much data you have, what query you're running, etc. You'd choose whichever rate to make your query complete in $reasonable_timerange.

I imagine users going "my query is too slow, let's trade accuracy for speed by moving the sampling slider".

yakkomajuri commented 1 year ago

RE changing the SAMPLE BY clause of events:

https://clickhouse.com/docs/en/sql-reference/statements/alter/sample-by/

neilkakkar commented 1 year ago

Not sure if you've also seen the linked issue: https://github.com/PostHog/posthog/issues/12909 -