Closed pauldambra closed 12 hours ago
Your pull request is modifying functions with the following pre-existing issues:
π File: posthog/api/capture.py
Function | Unhandled Issue |
---|---|
get_event |
UnicodeDecodeError: 'utf-16-le' codec can't decode bytes in position 18016-18017: illegal UTF-16 surrogate ... Event Count: 4 |
Did you find this useful? React with a π or π
we have a 10MB limit on messages in the replay kafka topic and we have gzip compression enabled on that topic
we did this to offload compression to MSK (probably.... it was a while ago π) but the gzip compression is anyway done in the producer so nothing is offloaded to MSK
we thought something along the lines of "10MB compressed is 1MB so we're allowing folk to send us 100MB of data in one message and that won't happen"
but
since the producer gzips the data and might batch the messages before compression kafka checks (i think based on some googling) the message size limit before compression, so we're not allowing 100MB we're allowing 10MB
this seems to be true since 10MB files that are rejected are 10MB uncompressed
and
replay inlines css files, so ~30 times a day a posthog.com full snapshot + its css goes over this limit and gets dropped
so, it's not massively unusual for us to see messages >10MB
ok, so what?
we purposefully stopped splitting individual items "chunking" to fit them into kafka because it made the already very stateful mr blobby even more stateful
i really really don't want to go back to arbitrary chunking of replay message
really really
really
this pr
we'll probably get slightly worse compression overall since we'll now always be compressing individual messages instead of letting the producer (potentially) compress across several messages
but from the perspective of a 10.1MB uncompressed message that would have been dropped this is infinity better
in deciding what to do here i tested every possible option (all three of them)
* i didn't write a protobuf schema, maybe i should have, but that felt like complexity i'd like to avoid ** average of operating with python on ~30 example files on my M3 MBP while running a tonne of electron apps and pycharm and a bunch of other things - so treat the timings as representative comparisons not predictions
tested gathering and playing recordings with the setting on and off
even though gzip is a chunk slower than msgpack the savings are so much larger that it's worth it especially since the instance is already spending the time running this compression
Β things i didn't do
start messing around with the kafka client to alter its behavior which i probably could do but feels like an excellent way to confuse everyone