PostHog / posthog

🦔 PostHog provides open-source product analytics, session recording, feature flagging and A/B testing that you can self-host.
https://posthog.com
Other
20.8k stars 1.24k forks source link

Save old session recordings to S3 #2142

Closed macobo closed 9 months ago

macobo commented 3 years ago

Is your feature request related to a problem?

We prune session recordings regularly. However accessing old recordings can give a lot of insight.

We could (if credentials are provided) store the session data in S3 or compatible service. This way of storing data is less expensive and faster for long-lived recordings.

Describe the solution you'd like

User adds their credentials under project settings and we back this data up, storing references to saved sessions in postgres.

Under cloud, we could allow retention if user is on a higher tier.

Describe alternatives you've considered

Additional context

Split this out from #1846 as clickhouse work has made this a bit harder to do.

Thank you for your feature request – we love each and every one!

paolodamico commented 3 years ago

Great stuff! Just adding to the conversation here, as this will permit storing recordings for longer time periods (even indefinitely) I suggest we ship this with a mechanism to comply with GDPR (or equivalent) requests and erase recordings that pertain to persons that file these requests.

paolodamico commented 2 years ago

Please see relevant Slack thread. Currently the recordings are only kept on Clickhouse-based installations for 3 weeks. There's no way to preserve recordings for longer (either all or a subset). As such, it may be more important to consider something like this.

Some customers have reported pain with this (see thread for details).

macobo commented 2 years ago

@pauldambra @rcmarron @hazzadous @timgl @yakkomajuri some feedback on the current plan for session recordings.

Outline/my understanding of the plan

Instead of storing individual snapshot events for session recordings in clickhouse, we save each event to S3, storing pointers in clickhouse and fetching the events from s3 when fetching the session.

Problems with the approach

Performance

On average, session recordings have 150 events, p95 being 500 events and p99 of 2691 events (metabase)

This means to load a given session we need to:

  1. Still query clickhouse, fetching N payloads containing pointers to that session
  2. Load each event separately from S3/alternatively

(1) is still the same old query - this will still load a bunch of data. (2) is executed serially in python, meaning that if each request takes 100ms we end up spending 15s loading an average session.

Expense

Retrieval: Each S3 GET request costs $0.0004 (source). This means that loading a thousand recordings on average will cost us at least 0.06$ (which isn't terrible).

Saving theses sessions: We would be doing on the order of billions of PUT requests per month to S3. According to their pricing page, this would cost us in thousands-to-tens-of-thousands. metabase

Ingestion

Ingestion/plugin-server is a really performance-sensitive piece of infrastructure where adding 100ms per event ingested can mean we aren't able to keep up ingestion.

Note that forwarding each event to S3 means a harder time keeping plugin-server lively.

Alternatives

My original proposal to @rcmarron was to:

  1. Create 1-N files per recording rather than event.
  2. Upload to S3 in batches asynchronously from ingestion
  3. Store recordings-to-s3 pointers in a table in ch or postgres for fast retrieval

Note

I want this to succeed so if I missed something obvious as to why this will work let me know.

mariusandra commented 2 years ago

Great analysis! I've raised similar points earlier in a review and in a few calls.

The alternative proposal seems solid, and we can still reuse the object storage work.

rcmarron commented 2 years ago

It seems like a next step here is to hash out more details on what an alternate approach could like.

Product requirements

Note: this isn't an exhaustive list, but these are important

Proposal

I'm very open to other options here, but wanted to put some thoughts on paper to start a discussion (I believe this captures what @macobo and I had talked through a couple months ago)

The big advantages of this approach are that:

cc @pauldambra @mariusandra

macobo commented 2 years ago

@rcmarron thanks for the writeup, this sounds good to me, especially capturing the nuances of splitting the recording into chunks!

Couple of additions on my part:

Product requirements

Additions to proposal

Other notes

List for recordings is generated by merging data from the existing event list and the "session recording table" above

Run two queries (which should return identical schemas) - one for sessions in object storage, one for sessions in clickhouse and uniq. You can also UNION ALL these :)

Note that if clickhouse list query is becoming slow long-term we can set up projections to speed that query up!

hazzadous commented 2 years ago

Late to the party 🥳

One more alternative. Given that we already have a buffer that is Kafka, we do not need to push these into ClickHouse. Rather we can batch and push data to S3 in batches on the fly, thereby keeping ClickHouse only for analytics workloads related to sessions, e.g. for calculation of session duration.

  1. to some extent deals with the expense point, would need to tweak some knobs re batch size etc.
  2. to some extent the performance on read. We'd be reading anything with a prefix of $session_id, we can put in some pagination to handle progressive loading
  3. it sidesteps any issues with the plugin-server but rather uses Kakfa Connect to handle this. We could also just write a Consumer specifically for this task if we wanted more control. tbh I haven't looked at what the plugin-server is doing with these events.

This would avoid needing to do any batch jobs to push the data from ClickHouse to S3.

Disclaimer: I have never used Kafka Connect or S3 Sink before, we may want to instead consider using our own Consumer. It should however be battle tested at large scale.

I've put together a prototype here #prsoverissues

hazzadous commented 2 years ago

(also, happy to help out however is needed)

pauldambra commented 2 years ago

love the kafka to s3 connector idea @hazzadous!

(2) is executed serially in python, meaning that if each request takes 100ms we end up spending 15s loading an average session.

I'd expect s3/minio reads to be way faster than that... but #9901 was only intended to be a deployable step. I was anticipating passing the paths back instead of the data and having a slim API endpoint to let the UI read them. That would let that data end up in cloudflare and browser caches so if there are subsequent views they don't need to come from us.

macobo commented 2 years ago

Given that we already have a buffer that is Kafka, we do not need to push these into ClickHouse. Rather we can batch and push data to S3 in batches on the fly, thereby keeping ClickHouse only for analytics workloads related to sessions, e.g. for calculation of session duration.

Note that one of the problems outlined https://github.com/PostHog/posthog/issues/2142#issuecomment-1143220818 is that we're still storing every single session recording event as a separate object in S3. From a retrieval perspective this is slow/unneccessary as we have a concept of sessions instead.

You mentioned prefixing by $session_id - perhaps that solves that concern for us? Can you elaborate a bit/go into more technical detail?

it sidesteps any issues with the plugin-server but rather uses Kakfa Connect to handle this. We could also just write a Consumer specifically for this task if we wanted more control. tbh I haven't looked at what the plugin-server is doing with these events.

Plugin-server currently

You can find the code in plugin-server/src/worker/ingestion/process-event.ts

@pauldambra

That would let that data end up in cloudflare and browser caches so if there are subsequent views they don't need to come from us.

Is that the request pattern for session recordings? My expectation is that the vast majority recordings are read between 0..1 times, with the actual value being closer to 0.01 in most cases. In this case the caching isn't really relevant as almost all recordings are opened once.

hazzadous commented 2 years ago

Note that one of the problems outlined https://github.com/PostHog/posthog/issues/2142#issuecomment-1143220818 is that we're still storing every single session recording event as a separate object in S3. From a retrieval perspective this is slow/unneccessary as we have a concept of sessions instead.

Note that the S3 connector handles batching events into single S3 objects. For instance, you can set that each object either have a size limit, message limit, or time limit, or a combination of these. e.g. you could say each batch should not be greater than 10mb, or 30 seconds, which ever comes first. I think this should suffice on the PUT cost front although there's still a risk there.

As for slow reads that could be an issue in some cases 🤔 this can partially be tackled with parallel reads, which is pretty scalable for S3 but it's something to watch out for. If we wanted to we could periodically compact, or have a separate connector that handles larger chunks at the expense of latency.

Plugin-server currently

As for the other functions of the plugin-server, we can leave this as it already is if this is required. Either we continue sending the same message as we do now (see the prototype) or we can send a striped down version. Either way I propose initially we continue sending the messages as they are for now, we can feature flag the sessions snapshot endpoint, nothing else needs to change for the time being.

My expectation is that the vast majority recordings are read between 0..1

I'm not sure exactly what the pattern looks like, I'll have a look at the logs/PostHog insight for that!

hazzadous commented 2 years ago

You mentioned prefixing by $session_id - perhaps that solves that concern for us? Can you elaborate a bit/go into more technical detail?

This is how it's implemented in the prototype. Not optimized, but gives the general flow of things.

guidoiaquinti commented 2 years ago

Data is put into ClickHouse as it is today

I would like to challenge this point as well. The proposal from @hazzadous looks very solid to me and I don't see major downside of implementing it.

posthog-contributions-bot[bot] commented 2 years ago

This issue has 2002 words at 13 comments. Issues this long are hard to read or contribute to, and tend to take very long to reach a conclusion. Instead, why not:

  1. Write some code and submit a pull request! Code wins arguments
  2. Have a sync meeting to reach a conclusion
  3. Create a Request for Comments and submit a PR with it to the meta repo or product internal repo

Is this issue intended to be sprawling? Consider adding label epic or sprint to indicate this.

pauldambra commented 2 years ago

This issue has 2002 words at 13 comments.

so responding to a specific part...

Data is put into ClickHouse as it is today

Instead of having an ongoing process where we write to ClickHouse and then migrate to S3 after a delay

We have a setting/flag that enables writing to S3 and an async migration that moves data from ClickHouse to S3 once.

Before that migration has started: we write to CH and S3 and fallback to reading from CH. While it is running: we write snapshot data only to S3, and read only from S3.

After it has run: no snapshot data is written to or stored in CH.

At this point we don't need to send snapshot data to the plugin server so it is handling smaller messages

NB I'm making two assumptions

1) reading/writing to CH and object storage have similar latency

We're still transferring the same amount of data over the network (at least by the end of the migration)

2) we want to allow people to carry on storing recordings in CH

Don't turn on the setting/flag. Never run the migration. Code has two read/write paths based on that setting

pauldambra commented 2 years ago

So, I think we're discussing between two options (if we ignore the option of not doing it)

A B
Ingestion Write all to CH Write all except snapshot to CH, write snapshot to S3
Chunking On a timer, chunk from CH to S3 with kafka connect At write time, chunk to S3 with kafka connect
Read From S3, fallback to CH From CH and S3 (to cover recordings started in CH and finished in S3) (Table TTL means CH stored data eventually removed and all recordings only have snapshots in S3)

In both options

pauldambra commented 2 years ago

I think that

affect both options equally

mariusandra commented 2 years ago

This sounds like a hour long Thursday meeting. Let's get everyone in a room and talk it over? @pauldambra @hazzadous @guidoiaquinti @macobo @rcmarron

To add more context and more fuel to the issue, my big fear is this:

At write time, chunk to S3 with kafka connect

First, looks like kafka connect is another container you need to run via docker/helm? Adding another depending like that is not free... 🤔 Also those 13 todos in that PR scare me 😅. Doable of course, but it's a big project to get this just right.

Second, how can we reliably assure that a stream of snapshot events for each $team_id/$session_id/$window_id combo gets reliably chunked into as few 100kb-1MB pieces as possible. This means not smaller, not larger.

In my mind, the main complexity with this project is in assuring we can reliably chunk the data. Otherwise the worse case "we will pay a 💩-ton of money" scenario outlined by Karl here kicks in and we'll have a new project to implement what we could have in the first place :).

Does kafka-connect employ some kind of dark magic to assure such chunking with our volumes, or what am I missing here? We read data from Kafka in batches, and then mark the entire batch as processed. The batches in the plugin server usually go up to a few hundred events. We could possibly make them bigger, but if you assume we're receiving snapshots from a few thousand unique users every minute, it's highly unlikely, that more than just a few events for each $team_id/$session_id/$window_id end up in one batch.

With this, we'll reduce the files uploaded to S3 by a volume of 1x (expected)-3x (best case), not the 50x we could with post-batching.

Is there some special place where kafka-connect stores these in-flight events to assure decently sized chunks? In memory? Can we configure it to use clickhouse for buffering? 🤣.

So what am I missing?

pauldambra commented 2 years ago

Does kafka-connect employ some kind of dark magic to assure such chunking with our volumes

yes :) https://github.com/PostHog/posthog/pull/10142/files#diff-44c65cdf7d29921a0b4dcf47453356304816877fad1a587637b769fd0dc0580dR32

So no code for us to maintain to do it, it is set to a 10x saving. But we can tweak that as needed.

We read data from Kafka in batches,

No reading for us to do... write to Kafka (which is the storage while not committed to disk). kafka-connect writes to S3 in chunks for us.

mariusandra commented 2 years ago

Is that the max or min saving? Where does it buffer the events to assure it?

mariusandra commented 2 years ago

Apologies for continuing to harp on this :).

I'm still trying to understand the math here. Like Karl calculated above, one average recording has 150 events, and it's more like 500 for p95. Each recording lasts several minutes.

So for sake of simplicity, let's say 150 events over 15 minutes in one recording. That's 10 per minute. Now also assume we get about 1000 session recording events per second (60k/min). (Real metrics are a bit different, and I'm not sharing them here, but this works well as a rough example).

This means 10 events out of every 60000 belong to this session. I don't know how kafka-connect works under the hood, but unless it has a huge persistent volume that it uses for buffering, it likely stores these chunks in memory, or only chunks per-batch. Even if it loads 1000 events per batch (4x what we experience on cloud during lag when batch sizes get huge), it's still very very unlikely there will be any chunking happening.

If kafka-connect actually does have a "one minute memory buffer" (needs to be measured in gigabytes), this means our recordings will always lag by at least a minute or two, not the neat "few seconds" they do now.

TL;DR: this probably will work as advertised if we have very few groups, but we have thousands of unique $window_ids every minute, and growing.

Thus, please, let's go with post-processing. :)

pauldambra commented 2 years ago

This means 10 events out of every 60000 belong to this session.

That's not right (or at least it's mathematically correct but not how this works - although @hazzadous can correct me maybe)

We've set the topic to be partitioned by team_id, session_id, and window_id https://github.com/PostHog/posthog/pull/10142/files#diff-44c65cdf7d29921a0b4dcf47453356304816877fad1a587637b769fd0dc0580dR25

So if you have a stream of 150 messages for a window in a session. It will flush at every tenth. Writing 15 times not 150. Regardless of how many other sessions there are.

And each of those sessions will be chunking every tenth

Frustratingly you can't link directly to specific config... https://docs.confluent.io/kafka-connect-s3-sink/current/configuration_options.html

You can also set millisecond intervals. So, write every tenth event or 500ms which ever comes first (for sake of argument)

hazzadous commented 2 years ago

@mariusandra do you happen to know what the post-processing requirements would be e.g. in terms of numbers of queries of ClickHouse per second, and volume of data, ClickHouse memory requirements? I haven't looked at this at all, and we should definitely hammer out the numbers for either solution. I guess we can be selective in terms of which recordings we migrate.

We do have experience with doing large migrations on ClickHouse with the async migrations I guess, so that's one data point on how this would perform.

Kafka Connect is definitely an unexplored bit, and how the S3 Connector handles high cardinality partitioning. Re. e.g. memory, storage etc. and how Kafka Connect would work I don't know exactly, but this data is either in a Kafka process/disk or it's in ClickHouse memory / disk.

To de-risk the Kafka Connect piece of the puzzle we could have a look at just emitting the events into a separate Kafka testing it out with AWS MSK Connect and observe how it performs. It's on the list of things to do, @pauldambra maybe if you have time we could do this but at the moment I'm trying to focus on other priorities, after those I should be a little more free.

mariusandra commented 2 years ago

If it's a real kafka partition that will be created for each group, we're in for a whole different host of issues. We will create tens or hundreds of thousands of unique partitions per day.

Currently Kafka supports:

Zookeeperless kafka, whenever that's out, should support millions of partitions, but we have no idea when that's out. Redpanda is another solution, but untested and a big leap. When we previously talked with @fuziontech about this, with the idea to have a partition per team_id, we realised this is not at all practical. Having even more "team_id AND window_id" seems impossible.

hazzadous commented 2 years ago

@mariusandra these partitions are not Kafka partitions afaik

mariusandra commented 2 years ago

Where does it store the events then? 😆 Or at least the metadata regarding which event in the last gigabyte of data belongs to which partition? How often is a kafka batch committed? How frequently does this batch to S3?

I get that kafka-connect has a lot of configuration options, but those are wishes, not guarantees. We still can't escape math... I think. Please tell me if I'm wrong, how I'm wrong.

The way I see it, we have this huge unknown question that throws the feasibility of the entire implementation into question, and nobody can tell me how it'll actually work. Instead the approach is "let's spend the next 2 weeks on all the other details of the implementation, and then let's hope the main issue is also resolved".

And, disclaimer, I can be wrong, but:

it's mathematically correct but not how this works

So how does it work. Can anyone please tell me before we invest more time in this? 😅

I scheduled a meeting for tomorrow to discuss this in sync, in case that's needed.

mariusandra commented 2 years ago

@mariusandra do you happen to know what the post-processing requirements would be e.g. in terms of numbers of queries of ClickHouse per second, and volume of data, ClickHouse memory requirements?

I don't have the numbers right now, but basically everything that goes in must go out 🤷 😆. So it'll just select * from each_partition_in_clickhouse, and either group by team_id or group by team_id, window_id or yolo. I'm not sure. In any case we'll have "all the time in the world" to move the data from one place to another.

fuziontech commented 2 years ago

Drive by suggestion 🚗

We could just use ClickHouse to manage this.

I've made a sample table posthog.testing_s3_session_recording_events on CH8 (backup node) that is 99% backed by S3 and partially backed by st1 EBS volume. This is totally configurable.

If you log into metabase and run the following on CH8

select table, disk_name, count(1) parts, sum(bytes_on_disk) bytes_on_disk from system.parts
where table = 'testing_s3_session_recording_events'
group by table, disk_name
order by bytes_on_disk desc;

You can verify that most of the parts for the table reside in s3 with the above query.

This was mainly inspired by Altinity's post https://altinity.com/blog/clickhouse-and-s3-compatible-object-storage which I think we will want to use long term for all long term data retention at some point. Having a tiered storage solution would be really valuable and this seems like a great first step.

The nice thing about this is:

Negatives here:

Go ahead and query this table to see what the performance implications are, but they should be relatively minimal.

At the time of writing this we are about 10% complete with the upload to S3 with 550GB uploaded and 4211 files uploaded. This means we have an average file size of 130mb which is really quite ideal for S3, much better than a file per part of a session recording or even a file per session recording. ClickHouse is really quite efficient at grabbing just the parts of files out of S3 that it needs since it has the metadata store that describes what it needs to grab out of a specific file on S3 for a query. ClickHouse intelligently uses S3's Range query feature which grabs just the bits of a file that are needed for the results. 💥

Give the table a try and let me know how performance looks. 👀

mariusandra commented 2 years ago

This is sweet. To understand the dynamics better, what happens on reads over large volumes, or, said differently, will we get charged an integer amount of dollars every time someone opens the session recordings page or makes any query with a filter?

macobo commented 2 years ago

Note that S3-backed-table doesn't do any "session rollups". Meaning if we query the table it reads either the whole (or maybe only a subset) of a partition from S3, then constructs the recording in-memory inside clickhouse. In essence we're doing what we're doing now, just on a slower disk.

I still maintain that rolling up sessions into one or multiple "blobs" is the way to go for read performance reasons. That I'm willing to be surprised if it turns out this isn't needed at all.

pauldambra commented 2 years ago

I still maintain that rolling up sessions into one or multiple "blobs" is the way to go for read performance reasons. That I'm willing to be surprised if it turns out this isn't needed at all.

On this...

I simulated this by changing the DEFAULT_CHUNK_LIMIT when viewing a recording. https://github.com/PostHog/posthog/pull/10198

We load the pages of recording data serially. So loading a larger blob is faster overall.

But...

We start playing the video as soon as the first blob is available. So it appears quicker to the viewer if the first blob (or first few) is quick to download even if the browser takes longer overall.

So in order to maintain a snappy start to playback of recordings...

Bundling the files to S3 probably needs to slowly increase the size of the blobs over the life of the recording so that we can buffer faster than playback

e.g.

instead of 5 x 10MB blobs

2 x 0.5MB 2 x 1MB 1 x 2MB 1 x 5MB 4 x 10MB

(or whatever works in practice)

pauldambra commented 2 years ago

dumping comments from sync discussion here before thinking about actual implementation steps

Kafka-Connect

Backing CH Tables with S3

Post Processing

Desired end-state

Pros

Needs to have solutions for

mariusandra commented 2 years ago

So... we had the call, and I think the conclusions were these. Correct me if I'm wrong.

Option 3, JamS3.

Conclusion: this could work well, but it won't be faster than what we have now (must read all chunks to serve the first), and we do need to split the tables to metadata and data.

Option 2, Post-Processing

Conclusion: this is similar to JamS3, but we do more piping on the app side. Since we pre-calculate the chunks and do upfront optimizations, playback will be 🔥.

Option 1, Live Streaming via kafka-connect (partially implemented)

fuziontech commented 2 years ago

This option is very cool, but is still half of a solution. We do need the metadata in ClickHouse, in a place where we can quickly query it. Otherwise queries like these (basically any listing query), will repeatedly fetch a lot of partitions from S3, causing problems and 💸.

It shouldn't be too bad. That query is only pulling the following columns because ClickHouse is columnar: session_id, window_id, timestamp, distinct_id, has_full_snapshot

Meaning that the actual amount of data pulled from s3 is pretty small (since the actual payloads are not getting pulled).

The table itself is ordered by: team_id, toHour(timestamp), session_id, timestamp, uuid

which means that as long as we are able to hit these 'indexes' the size of the files pulled from S3 can be scoped down to a relatively small portion.

We are partitioned by: PARTITION BY toYYYYMMDD(timestamp)

The number of files that are pulled from s3 can be calculated with this formula (ideally): partitions x columns

I just ran this query as an example:

SELECT session_id,
    any(window_id) as window_id,
    MIN(timestamp) AS start_time,
    MAX(timestamp) AS end_time,
    dateDiff(
        'second',
        toDateTime(MIN(timestamp)),
        toDateTime(MAX(timestamp))
    ) as duration,
    any(distinct_id) as distinct_id,
    SUM(has_full_snapshot) as full_snapshots
FROM testing_s3_session_recording_events
WHERE team_id = 2
    and timestamp >= '2022-05-21'
    and timestamp <= '2022-05-27'
GROUP BY session_id
HAVING full_snapshots > 0

It executed in 10544 rows in set. Elapsed: 0.658 sec. Processed 387.93 thousand rows, 82.80 MB (589.68 thousand rows/s., 125.85 MB/s.)

The number of files selectively range gotten from S3 should be roughly: 5 columns x 7 partitions = 35 files

I'm sure that there is a way we could better format the schema to improve this (metadata and data like @mariusandra suggests). I'd love to hop on a call to talk about this and how ClickHouse is storing this merge tree in S3. This just makes our app a lot simpler and keeps the complexity inside of CH for the orchestration of recording storage.

Edit: I do like this idea:

Conclusion: this is similar to JamS3, but we do more piping on the app side. Since we pre-calculate the chunks and do upfront optimizations, playback will be 🔥.

Is there any way we can re-schema so that the chunks are optimized in CH? This seems ideal either way. Grabbing things from disk or S3.

pauldambra commented 2 years ago

ok, so that suggests a way forward

  1. split the snapshot- and meta- data for recordings by adding a new snapshot_data table (I could do with pairing @hazzadous as I've remained blissfully ignorant of CH and Kafka infrastructure until now)
  2. keep TTL on both tables
  3. experiment with backing the snapshot_data table onto S3
  4. either that doesn't perform well enough and we continue on to post processing, or it's at least good enough for rock and roll and we can pause and concentrate elsewhere (imagining a world where I work on the team's sprint goals instead of side quests 🤣)
pauldambra commented 2 years ago

I'm treating #10234 as a sketch... Think putting the wrong answer on the internet to shake the right answer loose :)

mariusandra commented 2 years ago

@fuziontech that is really cool. I don't have ssh access to the cluster (too lazy/busy to set it up now), and metabase doesn't report the "MB read" number. However when I removed the team_id clause, it took 5 seconds to give some results, probably reading around a gigabyte in back-of-the-napkin math. This sounds like it'll add up in dollars. We'll eventually have teams recording as many sessions as we do now, combined... so I guess we should still store the metadata locally, and just the recordings in S3.

@pauldambra if you'll keep working on it (which is fine by me), this basically will be your sprint goal, as getting this just right isn't a 1-day project. A MVP split could be 🤔

mariusandra commented 2 years ago

Is there any way we can re-schema so that the chunks are optimized in CH? This seems ideal either way. Grabbing things from disk or S3.

More metadata :party_parrot:

The constraint is that we'll have events (avg 150 per recording) of varying sizes. Some 100 bytes, some 2+MB. However we want stable pagination based on both the recording timestamp and chunk sizes. We also need to know where in those chunks the "keyframes" are to be able to seek anywhere.

Maybe clickhouse could create that metadata itself with a query, and then we make another query to just fetch the relevant chunks... 🤔

macobo commented 2 years ago

@fuziontech you're measuring the wrong query.

So session recording has 3 components:

  1. Ingestion - getting data in
  2. "List all sessions query" - this is the query you were measuring
  3. "Get a single recording query"

It's the no 3 that needs most consideration in a "just-store-the-table-in-s3" scenario.

In an store-sessions-in-s3 scenario no 3 won't work well because it's reading the column containining 99% of the data (metabase). The normal rules of clickhouse apply here - to get a single recording we will be reading multiple partitions, and due to granularity much more extra data compared to what we need for that single recording.

Note this will be slower than what we have currently and what we have is a problem right now: grafana - if you add lag fetching from S3 it will only get slower.

I think when discussing this project we keep missing the important detail here: Conceptually thinking of session recordings are not the events, but rather the sum of them. The most efficient way to solve (3) is to have a file or multiple ones per recording containing all the events for that recording that you can directly request/stream on the client-side.

We side-stepped that by leveraging clickhouse initially. Rolling up sessions is kind of hard^tm IMO to do streaming, which is why I'm still in favor of the postprocessing approach where we move whole recordings to s3 in batches after the fact.

pauldambra commented 2 years ago

this basically will be your sprint goal,

team of one 🤣

Not complaining, it's valuable work... Just worried about pulling on other people's time and being a distraction for them


We query snapshots using

        SELECT session_id, window_id, distinct_id, timestamp, snapshot_data
        FROM session_recording_events
        WHERE
            team_id = %(team_id)s
            AND session_id = %(session_id)s
        ORDER BY timestamp

and page in memory...

If we split the snapshots and metadata then that's the only query we need to care about for S3 performance.

So, as is, we'd read all session snapshots from S3 multiple times for each recording. If we can make this "page-able" (definitely a word) then we'd read each bit of snapshot data once

We need the split in order to do post processing... and it'd be great to figure out how to make paging easier anyway.


I'd really like to get to a point where we can test the speed of reading from S3. IME (admittedly reading in AWS Lambda not in ECS) it is ridonculously fast. That may not be true in this context but would be good to know one way or another so we don't fix for problems we don't have (or build for affordances we don't have)

pauldambra commented 2 years ago

I'm still in favor of the postprocessing approach

Yep, I think absent discovering the S3 tables are incredible (enough) or uncovering un-fixable (also definitely a word) problems with post-processing then that's the approach now

If for no other reason than we've spent enough time debating it and need to make stuff and see if we discover some real reason to change course now :)

macobo commented 2 years ago

Note this will be slower than what we have currently and what we have is a problem right now: grafana - if you add lag fetching from S3 it will only get slower.

Potentially confusion-adding note on this, I ran a quick test on the table @fuziontech mentioned:

SELECT
    session_id,
    window_id,
    distinct_id,
    timestamp,
    snapshot_data
FROM
    testing_s3_session_recording_events
WHERE
    team_id = 2
    AND session_id = '180f03251a873b-06dda7934f4b06-34736704-384000-180f03251a92109'
ORDER BY
    timestamp

Against the current table (which should be disk-based, not 100% though) :2467 rows in set. Elapsed: 8.706 sec. Processed 773.40 thousand rows, 120.53 MB (88.84 thousand rows/s., 13.84 MB/s.) S3-backed testing_s3_session_recording_events table: 2467 rows in set. Elapsed: 9.674 sec. Processed 313.12 thousand rows, 88.77 MB (32.37 thousand rows/s., 9.18 MB/s.)

These seem comparable read performance wise, however:

See size query ```sql SELECT formatReadableSize(sum(length(snapshot_data))) FROM testing_s3_session_recording_events WHERE team_id = 2 AND session_id = '180f03251a873b-06dda7934f4b06-34736704-384000-180f03251a92109' ```
lancedouglas1 commented 2 years ago

In the interim, is it possible to "just" drop session recordings into us users' own S3 bucket and use session_id-based naming and dated folders, and let us users worry about both pricing and loading recording. As for events and meta data, that's all secondary for me anyways, I need the recordings archived for compliance auditing purposes.

mariusandra commented 2 years ago

Hey @lancedouglas1 this is actively being worked on: https://github.com/PostHog/posthog/pull/10799

image

pauldambra commented 9 months ago

we now store session recordings in S3, offering 30 days (up from 21 days) retention for free users and 90 days retention (up from 21) for paid

guidoiaquinti commented 9 months ago

Nice work @pauldambra and team! 👏