PostHog / posthog.com

Official docs, website, and handbook for PostHog.
https://posthog.com
Other
366 stars 382 forks source link

Blog: How we built upgraded session replay storage (to blobby) #8732

Open ivanagas opened 2 weeks ago

ivanagas commented 2 weeks ago

Summary

Write a short paragraph on what this article is about. If applicable, what's the opinion or point we want to make in this article?

Based on Ben’s notes on Mr Blobby, write a blog talking about our migration of session replay data from ClickHouse to S3 and the creation of Mr. Blobby

Where will it be published?

select any that apply

  • [x] Blog
  • [ ] Founders Hub
  • [ ] Newsletter
  • [ ] Product engineers Hub
  • [ ] Tutorials
  • [ ] Other (please specify)

Why type of article is this?

select any that apply

  • [ ] High intent (i.e. comparisons and similar)
  • [x] Brand / opinionated (how we work and why, etc.)
  • [ ] High-level guide (concepts, frameworks, ideas, etc.)
  • [ ] Low-level guide (step-by-step guide / tutorial)
  • [ ] Other (please specify)

Who is the primary audience?

select any that apply

  • [ ] Founders
  • [x] Engineers
  • [ ] Growth
  • [ ] Marketing
  • [x] HackerNews
  • [ ] Existing PostHog users
  • [x] Potential PostHog users

Headline options

suggest a few angles

How migrating replay data to S3 saved our life

How we saved 50k/m moving replay data to S3

How we solved the write vs store cost challenge for our massive amount of replay data

Storing more, writing less: How moving replay data to S3 saved us 50k/m

Will it need custom art?

Outline (optional)

draft headings / questions you want to answer

  • We moved from ClickHouse-backed session replays to S3-backed ones
  • Problem 1: Store and query
  • ClickHouse is good at writing (batching), but not for storing this type of data.
  • We try to use ClickHouse for everything. Our old version of session replay used it too.
  • It was very slow to load blob-like data, it’s not the intended use case
  • Also, 3 weeks of replay data took up more than all the other data combined
  • Solution (AKA problem 2): Writing somewhere else
  • Move it somewhere else, obviously
  • We want to write many small packets and store a lot of content
  • This makes replay data a bad fit for both blob-style data and traditional databases.
  • Real solution: Buffering
  • Our SDKs batch session replay events to keep packets sent to a minimum
  • Buffer data on disk and write to blob storage once threshold has past
  • This reduces write costs and enables us to benefit from cheap S3 storage
  • Architecture of real solution: Mr. Blobby
  • Buffer incoming data to disk, use Node.js streams
  • Grouping it by session, finding or creating SessionManager
  • Add data to SessionManager write stream
  • Stream to gzip
  • See what needs to be flushed using Kafka age, real time age, and size
  • Flush to S3 via Kafka(?)
  • Querying and using this data
  • The data is the same and loading from S3 isn’t too much of a change
  • Store some metadata in ClickHouse to enable quick recordings query and joining with persons or events
  • In-flight sessions, people expect relatively real time viewing, using REdis
  • Buffer uncompressed version to disk
  • Web app publishes redis message to request real time session
  • Consumers receive the event and begin replicating
  • Benefits
  • This saved us 30-50k/month, improved ClickHouse health, and improved loading speed.
  • Along with the usability benefits for users, it also allows us to pass along savings in the form of better filters and longer term storage
pauldambra commented 2 weeks ago

I think an additional useful angle is the challenge of shipping multiple times per day while the whole thing was running in production

we did a lot of

So that we could run ingestion without anything breaking and then switch playback backwards and forwards using flags

How Mr Blobby strangled our largest ClickHouse table