PostHog / posthog

🦔 PostHog provides open-source product analytics, session recording, feature flagging and A/B testing that you can self-host.
https://posthog.com
Other
21.23k stars 1.26k forks source link

fix: large old replay data splitting #23454

Closed pauldambra closed 3 months ago

pauldambra commented 3 months ago

we see a non-zero amount of MessageTooLarge errors

these are all (so far on sampling inspection of the data) from old clients that didn't have batching code and would sometimes send a lot of data in one go

around 1 in 5 of them have many hundreds of items to process 1 in 25 has tens of thousands of items

we already have code that should be splitting these out into individual events

but clearly it's not working

and we don't really want one API call to generate 10k kafka messages 🙈

so this PR

changes how we check the headroom - we're clearly under counting, this might help

i looked at how the data is going to be sent to kafka and tried to copy that so that we're counting bytes, and counting a similar bytes array, instead of counting characters, JS in the browser uses UTF-16 string and kafka/python is using UTF-8 so maybe there's some silliness happening here

split the list instead of exploding it

the final case in the processing if the non-full snapshots won't fit into headroom sends every item from the list individually

instead now, we keep splitting the list into 2 and checking the size of each half in theory this means the majority case is we'll split into one or two messages each with many events

pauldambra commented 3 months ago

removes the test here to get the PR moving instead of waiting for https://github.com/PostHog/posthog/pull/23466