PostHog / posthog

πŸ¦” PostHog provides open-source product analytics, session recording, feature flagging and A/B testing that you can self-host.
https://posthog.com
Other
19.45k stars 1.14k forks source link

feat: using gzip by hand in the replay pipeline #23479

Closed pauldambra closed 12 hours ago

pauldambra commented 4 days ago

we have a 10MB limit on messages in the replay kafka topic and we have gzip compression enabled on that topic

we did this to offload compression to MSK (probably.... it was a while ago πŸ™ˆ) but the gzip compression is anyway done in the producer so nothing is offloaded to MSK

we thought something along the lines of "10MB compressed is 1MB so we're allowing folk to send us 100MB of data in one message and that won't happen"

but

since the producer gzips the data and might batch the messages before compression kafka checks (i think based on some googling) the message size limit before compression, so we're not allowing 100MB we're allowing 10MB

this seems to be true since 10MB files that are rejected are 10MB uncompressed

and

replay inlines css files, so ~30 times a day a posthog.com full snapshot + its css goes over this limit and gets dropped

so, it's not massively unusual for us to see messages >10MB


ok, so what?

we purposefully stopped splitting individual items "chunking" to fit them into kafka because it made the already very stateful mr blobby even more stateful

i really really don't want to go back to arbitrary chunking of replay message

really really

really


this pr

we'll probably get slightly worse compression overall since we'll now always be compressing individual messages instead of letting the producer (potentially) compress across several messages

but from the perspective of a 10.1MB uncompressed message that would have been dropped this is infinity better


in deciding what to do here i tested every possible option (all three of them)

option average speed average size reduction
protobuf didn't check slightly bigger*
msgpack 0.13s for ~10MB** 20% smaller
gzip 0.6s for ~10MB** 90% smaller

* i didn't write a protobuf schema, maybe i should have, but that felt like complexity i'd like to avoid ** average of operating with python on ~30 example files on my M3 MBP while running a tonne of electron apps and pycharm and a bunch of other things - so treat the timings as representative comparisons not predictions

tested gathering and playing recordings with the setting on and off

even though gzip is a chunk slower than msgpack the savings are so much larger that it's worth it especially since the instance is already spending the time running this compression

Β things i didn't do

start messing around with the kafka client to alter its behavior which i probably could do but feels like an excellent way to confuse everyone

sentry-io[bot] commented 4 days ago

πŸ” Existing Issues For Review

Your pull request is modifying functions with the following pre-existing issues:

πŸ“„ File: posthog/api/capture.py

Function Unhandled Issue
get_event UnicodeDecodeError: 'utf-16-le' codec can't decode bytes in position 18016-18017: illegal UTF-16 surrogate ...
Event Count: 4

Did you find this useful? React with a πŸ‘ or πŸ‘Ž