Open robbie-c opened 7 months ago
commenting so I see responses too :)
will have a think about this in the new year if that's not too slow
num_pageviews: countIf(events.event == '$pageview')
num_autocaptures: countIf(events.event == '$autocapture')
could this be a map of event to count? (without checking if CH does that) there should be a relatively low number of distinct events per session and that reduces the need to keep adding columns
We should definitely consider if it makes sense to use the existing replay stuff for this.
Always a risk of trying to over re-use, but I think in this case it would make a lot of sense, as so much of the replay data can actually enrich the session data, with us only needing a couple of replay-only fields.
I think I'd rather join than write to the same table here... or rather the replay_events table and its queries weren't designed with sessions without recordings in mind so we'd need to account for that...
presumably we could generate the separate table and then join against it all without affecting production so that we can test performance of the join
Definitely should get @fuziontech's brain & advice 🧙
Thinking out loud here - I don't actually think this AggregatingMergeTree-based solution works well enough with how SAMPLE works in clickhouse, and we'll need sampling to make web analytics performant for customers with millions of sessions a week.
Explanation:
MergeTree
, I would assume AggregatingMergeTree
works the same way but it isn't mentioned explicitly in the docs)distinct_id
rather than the session_id
. (If we used the session_id
, the unique user count would be inflated)distinct_id
as part of the primary key, but we can't as it's a computed valueAnother way we could handle this would be to copy completed sessions into a new table. In this case it'd be pretty straightforward to have the distinct_id in the primary key
This would make life a lot easier... if showing the number of unique users is the hard part we could just not do that. Then session_id can be the sample key, and the number of sessions./ page views would still be accurate. This is maybe more justifiable under the "cheap mass market web analytics" product we are aiming for.
we'll need sampling to make web analytics performant for customers with millions of sessions a week.
We'll possibly need sampling 😊
ClickHouse is surprisingly fast if we can set the data up appropriately
we'd want to have the distinct_id as part of the primary key, but we can't as it's a computed value
I'm maybe missing something... distinct_id is a concrete value that is used to discover person id, no? (I do my best not to think about persons and identity so could be confused :)) if it's person id that's the problem. delay this until after PoE and then aggregate using person id instead of distinct id??
Maybe easier if we work from the other direction. What queries do we want to run that we can't today or that are slow today?
the replay_events table and its queries weren't designed with sessions without recordings in mind so we'd need to account for that...
post xmas brain...
We have event_count
on the replay table. So any session with event_count = 0 is a session without a recording...
We'll possibly need sampling 😊
That's a fair point, my back of envelope calculation was that we get an average of 20 events per session so looking at sessions rather than events would be a 20x speedup, which doesn't approach the 100x speedup we'd need to show data "for all time" for our biggest customers, whereas sampling could do that. With that said, there's other places we could get this speedup, for example right now we use person properties which we don't need to.
One thought is that we could go ahead with the AggregatingMergeTree approach for now, and if we need the extra performance from sampling, we could use this table as the source of session data to copy into the new table.
Create a Session table using AggregatingMergeTree, watching the Events table.
This would speed up some queries on web analytics that create this as a CTE. This is currently a bottleneck for showing data "for all time" on the web analytics dashboard on the posthog project.
It'd also be useful for answering questions like "why am I getting a spike of traffic today?" with something like the following hogql:
Currently you would do this at the persons level, but that gives you first-touch attribution so doesn't necessarily tell you why a bunch of your existing users re-engaged with your product today (which could be due to an email, an influencer, a tweet, etc)
I'd propose the following columns
The replay team might have some other suggestions for things to add. We do already have a session replay table, but this does not contain all sessions, only those that actually have a replay.