PostHog / posthog

🦔 PostHog provides open-source product analytics, session recording, feature flagging and A/B testing that you can self-host.
https://posthog.com
Other
19.62k stars 1.15k forks source link

Create Session table using AggregatingMergeTree, watching the Events table #19289

Open robbie-c opened 7 months ago

robbie-c commented 7 months ago

Create a Session table using AggregatingMergeTree, watching the Events table.

This would speed up some queries on web analytics that create this as a CTE. This is currently a bottleneck for showing data "for all time" on the web analytics dashboard on the posthog project.

It'd also be useful for answering questions like "why am I getting a spike of traffic today?" with something like the following hogql:

select $virt_channel_type from sessions where start_timestamp >= today()

Currently you would do this at the persons level, but that gives you first-touch attribution so doesn't necessarily tell you why a bunch of your existing users re-engaged with your product today (which could be due to an email, an influencer, a tweet, etc)

I'd propose the following columns

The replay team might have some other suggestions for things to add. We do already have a session replay table, but this does not contain all sessions, only those that actually have a replay.

pauldambra commented 6 months ago

commenting so I see responses too :)

will have a think about this in the new year if that's not too slow

num_pageviews: countIf(events.event == '$pageview')
num_autocaptures: countIf(events.event == '$autocapture')

could this be a map of event to count? (without checking if CH does that) there should be a relatively low number of distinct events per session and that reduces the need to keep adding columns

benjackwhite commented 6 months ago

We should definitely consider if it makes sense to use the existing replay stuff for this.

Always a risk of trying to over re-use, but I think in this case it would make a lot of sense, as so much of the replay data can actually enrich the session data, with us only needing a couple of replay-only fields.

pauldambra commented 6 months ago

I think I'd rather join than write to the same table here... or rather the replay_events table and its queries weren't designed with sessions without recordings in mind so we'd need to account for that...

presumably we could generate the separate table and then join against it all without affecting production so that we can test performance of the join

Definitely should get @fuziontech's brain & advice 🧙

robbie-c commented 6 months ago

Thinking out loud here - I don't actually think this AggregatingMergeTree-based solution works well enough with how SAMPLE works in clickhouse, and we'll need sampling to make web analytics performant for customers with millions of sessions a week.

Explanation:

  1. The sampling key must be contained inside the primary key (see docs) (caveat, that's a normal MergeTree, I would assume AggregatingMergeTree works the same way but it isn't mentioned explicitly in the docs)
  2. The sampling key should be the distinct_id rather than the session_id. (If we used the session_id, the unique user count would be inflated)
  3. 1&2 mean that we'd want to have the distinct_id as part of the primary key, but we can't as it's a computed value

Alternatives:

Copy completed sessions

Another way we could handle this would be to copy completed sessions into a new table. In this case it'd be pretty straightforward to have the distinct_id in the primary key

Don't display unique users

This would make life a lot easier... if showing the number of unique users is the hard part we could just not do that. Then session_id can be the sample key, and the number of sessions./ page views would still be accurate. This is maybe more justifiable under the "cheap mass market web analytics" product we are aiming for.

pauldambra commented 6 months ago

we'll need sampling to make web analytics performant for customers with millions of sessions a week.

We'll possibly need sampling 😊

ClickHouse is surprisingly fast if we can set the data up appropriately

we'd want to have the distinct_id as part of the primary key, but we can't as it's a computed value

I'm maybe missing something... distinct_id is a concrete value that is used to discover person id, no? (I do my best not to think about persons and identity so could be confused :)) if it's person id that's the problem. delay this until after PoE and then aggregate using person id instead of distinct id??

Maybe easier if we work from the other direction. What queries do we want to run that we can't today or that are slow today?

pauldambra commented 6 months ago

the replay_events table and its queries weren't designed with sessions without recordings in mind so we'd need to account for that...

post xmas brain...

We have event_count on the replay table. So any session with event_count = 0 is a session without a recording...

robbie-c commented 6 months ago

We'll possibly need sampling 😊

That's a fair point, my back of envelope calculation was that we get an average of 20 events per session so looking at sessions rather than events would be a 20x speedup, which doesn't approach the 100x speedup we'd need to show data "for all time" for our biggest customers, whereas sampling could do that. With that said, there's other places we could get this speedup, for example right now we use person properties which we don't need to.

One thought is that we could go ahead with the AggregatingMergeTree approach for now, and if we need the extra performance from sampling, we could use this table as the source of session data to copy into the new table.