macobo commented 3 years ago

Is your feature request related to a problem?

We have had some reports on session recordings being incomplete/misleading/having white screens. https://github.com/PostHog/posthog/issues/2457 https://github.com/PostHog/posthog/issues/2365 https://github.com/PostHog/posthog/issues/2459

We've already made some fixes and some workarounds: https://github.com/PostHog/posthog-js/blob/master/CHANGELOG.md https://github.com/PostHog/posthog/pull/2547

These seem all to be caused by us not receiving all of the events from posthog-js. This can be caused by: a. rrweb (and our integration with it) not correctly capturing these events b. sending these events timing out or otherwise failing on client-side (most suspect right now) c. processing these events failing inside our APIs due to payload sizes or other

Some of this is inevitable - if user leaves the page before we manage to send data from posthog-js then it's natural these sessions are cut short.

Describe the solution you'd like

1) Track (and keep tracking) internal reliability metrics on posthog-js side (https://github.com/PostHog/posthog-js/pull/136). 2) Dogfooding: Make this data nicely presentable within app.posthog.com - currently these graphs would end up being unreadable.

Ideally these should inform us of where the issue is and help us track progress on our end as we implement future fixes such as retries and so on.

Describe alternatives you've considered

Waiting for go-based API endpoint to rule out c).

Additional context

Thank you for your feature request – we love each and every one!

macobo commented 3 years ago

I've posted two summaries into slack previously, gonna put them here as well for posterity:

First learnings on session recording reliability from stats from our sites https://app.posthog.com/dashboard/1240:

Posthog-js made 5671 requests 1-2 hours ago. This is from 66 different page loads. Of those, 5629 got a successful 200 (99.26%) Of the 42 failures: 8 were 'http status 0', rest were (assumption) in flight and likely cancelled after unload. Status of 66 /e/ requests + 16 /s/ requests is unclear - they were sent on the way with sendbeacon which is more likely to fail than normal requests. Those requests contained a total of 15236 .capture() calls - of which 14348 were session recording (!) - 94%. 5xx status does not seem to be a common issue: last hour 20 requests failed with status 0 (?), 24 were in flight during unload and only 1 failed with a 5xx (only failure during the whole day).

Learning points: Need to figure out what causes http status 0. At first glance it seems like something retries would be effective at solving. The number of in flight requests is also troubling - it would be interesting to know how much of this is the users network, how much is it our API servers. We could also implement retries across pageloads to increase odds of requests reaching us, though this bloats the library significantly. Session recording increases load per site significantly. At the moment I think a 10x-15x increase in requests a reasonable ballpark when doing calculations. This will likely start playing into pricing as well and bumps the priority for being able to capture snapshot of the data/storage in S3. Doing more frequent requests for session recording seems to be correct. It might be worth it to bump the frequency of normal capture call send frequencies - though I think those numbers are affected by immediate bounces.

The next day:

Yep. Data is ticking in again with the enhanced analytics on posthog-js. https://app.posthog.com/dashboard/1240

What I've learned: Dropping of full snapshot events is definitely visible from client side - so they're not reaching our servers. Majority of full snapshots are dropped because the request has started but not finished when the page is unloaded. AVG size of full snapshot event (for us) is 330kb and it times out at 3-5s after the request is made (either due to http status 0 or unload) HTTP status 0 is likely caused by combination of client-side request cancellation and /e/ endpoint fails/timeouts. When /e/ fails the client is getting a CORS error (which translates to status 0) since we only set the right headers on a successful response. We also drop /e/ events, but these are taking 'just' 300ms on average while being 1kb in size.

Open questions: Why are requests taking that long? Is 330kb too large to send for a full snapshot? Or are our API servers/load balancer just taking too long or geographically far from clients? The go-based API endpoints could answer this question. I'm currently not measuring average response times/p95 on client-side as it's slightly too heavy for posthog-js. It would be great to have access to these numbers though, but impossible/hard without killing data budgets with our existing capture&data analysis tooling.

Notes: Given that the main bottleneck is full snapshots, I don't think implementing retries> will have any significant effect on it. One tool we have is adding strict xhr timeouts on client-side as a forcing function.

macobo commented 3 years ago

Since then we found at least one of the true bottlenecks: lzstring in python is slow enough to cause large events to time out.

Next actions

I'd like to try an experiment where we don't compress the json data at all for our sites (app.posthog.com/posthog.com). Two PRs for this: https://github.com/PostHog/posthog/pull/2648 https://github.com/PostHog/posthog-js/pull/144

Future

If this removes the failures we're seeing at https://app.posthog.com/dashboard/1240, we can discuss switching or dropping compression schemes or optimizing the python lzstring implementation. Any suggestions for light-weight browser compression libraries welcome!

Fullsnapshot events are pretty massive though - after this experiment I'll also experiment with some tooling new to rrweb since we integrated it.

LZString benchmarks

I took 4 sample requests from my network panel and generated 2 more (26kb in size + 2.5mb in size) and did some measurements.

On posthog.com the average /e/ with lzstring compression is ~500b, large session recording batch ~2kb, large update (switching from homepage to docs) is 9kb and full snapshot event ~370kb.

Time

Measurements were done using my lenovo laptop using ipython %timeit command.

424b: 3.03 ms ± 28.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
520b: 3.49 ms ± 28.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2kb: 18.7 ms ± 226 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
9kb: 66.7 ms ± 147 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
26kb: 188 ms ± 2.16 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
372kb: 3.23 s ± 106 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
2511kb: 22.4 s ± 455 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Compression ratios

I've excluded the two artifically generated payloads (as these would compress well):

424b: ratio=1.103932584269663 (original=0.34765625kb, compressed=0.3837890625kb)
520b: ratio=1.1333333333333333 (original=0.4248046875kb, compressed=0.4814453125kb)
2kb: ratio=0.7630739795918368 (original=3.0625kb, compressed=2.3369140625kb)
9kb: ratio=0.3572684298941282 (original=24.9970703125kb, compressed=8.9306640625kb)
372kb: ratio=0.6416829491564855 (original=579.2705078125kb, compressed=371.7080078125kb)

Unscientific takeaway: For small requests (which are the majority of requests to /e/) compression can hurt, but other than that compressing the data has a great effect.

mariusandra commented 3 years ago

Is the lzstring JS implementation equally slow compressing the payloads? If posthog makes a weaker computer's browser freeze while sending the payload, then we should change it no matter how slow the python implementation is.

macobo commented 3 years ago

I have yet to do experiments on that regard.

As mentioned above, I'd like to have verification first on if this solves the issue for our team first (see the linked PRs which do not affect any other clients). Then if the hypothesis is proven focus on choosing the best alternative for the future, if not - back to the drawing board. :)

Will keep this thread updated on that progress.

macobo commented 3 years ago

No updates on skipping lzstring yet - data is still ticking in. However I did some measurements on browser-side compression algorithms.

Benchmarks

rrweb now (optionally) ships with pako.js based deflate algorithm built in, so I used that to compare against lzstring.

Raw data/notes can be found in this gist

compression ratios (compressed size / uncompressed size)

This was tested using posthog.com and seeing real content-sizes for fair comparisons. branch

Compression ratios when compressing events in batches (smaller = better):		pako.deflate	lzstring
session recording	0.18-0.25	0.45-0.55
normal events	0.11-0.15	0.5

Increasing batch frequency did not significantly change batch ratios.

Compressing individual events yielded a overall compression ratio of around 0.25 for session recording. This has a nice property in that no server-side changes are needed as the decoding can be done by the client/server when serving a recording.

Takeaway: If we can get away with it, using pako.deflate is 2-2.5x better than lzstring for compression ratios

The above was slightly wrong as it did not account for compression of the whole event body - only rrweb parts. In reality this would not work.

lzstring vs pako.deflate speed

Did some measurements on my computer via https://jsbench.me/, sadly cannot link due to sites size limitations.

payload size	pako.deflate	lzstring.compressToBase64
500kb	16.39ops/s +- 8.56%	3.98 ops/s +- 2.8%
3kb	1942.16 ops/s +- 3.75%	165.39 ops/s +- 16.5%
300b	2476.65 ops/s +- 6.8%	2830.78 ops/s +- 11.84%

Takeaway: pako.deflate is consistently many times faster with larger payloads and the speed is comparable

Note: For some reason our bundled version of lzstring is somewhat slower than the one on cdnjs - did not dig into why.

Package sizes

minified version of pako.deflate is 25kb (!). LZString is much smaller, marius has floated the figure 5kb in the past.

Takeaway: If we go for pako, we might want to only include it when session recording is enabled and pull it together with recording.js.

macobo commented 3 years ago

Last time I started the following experiment:

Does replacing lzstring with no encoding reduce the rate of full snapshots that fail to be delivered?

What we know from this?

From initial measurements it seems not quite. Will do a more through pass on monday, but we're still seeing failures http status 0 errors which take several seconds to do.

This is slightly surprising since livesession and co don't compress the payloads they send in obvious ways (besides making json keys obfuscated)

This means that even if we have removed the obvious bottleneck on the server side, we're still seeing timeouts. This might be confounded by the aws migration that @fuziontech did, but grafana stats look relatively similar to from before.

On my own computer I'm seeing 1s+ requests for full snapshot events and 300-400ms requests for normal (tiny) events - it still might be something going wrong on the serveri-side. In this case, increasing the batching threshold for session recording might help.

Next steps

Given this, I'm proposing:

benchmarking requests for full snapshot events with hey - it's possible something is still weird with our server-side setup
shipping a beta version, which includes pako.deflate to deflate rrweb events and
using that to compress individual events for our team only (all other clients unaffected)
doing the decompression in the frontend (in a way that only affects our data, easy)
doing measurements on how it affects failed request rates, request sizes and durations

This means that even if we don't end up using pako long-term (either due to not solving the problem, package size or lzstring being more efficient in node), we will only break some sessions for our own app by rolling back.

Notes on tooling

I'm severely lacking good tooling to do some of this:

On posthog-js side, I want to avoid flooding posthog with a ton of events, but it makes it hard to measure mean/p90/p99 request durations for session recording without introducing a lot of complexity. Oh well.
On app.posthog.com I've stumbled upon a lot of pain points doing the analysis (see the linked tickets). Of these, https://github.com/PostHog/posthog/issues/2587 and https://github.com/PostHog/posthog/issues/2620 are most painful, I might prioritize fixing these as well.

macobo commented 3 years ago

Since last update:

Benchmarking our endpoints

Created a k6 benchmark of our /s endpoint in production. Metrics in this gist

Takeaways:

Sending 500kb of data is slooow even on my semi-decent connection, so compressing the payloads will help for sure
Not sure whether my upload was capped on my end or on the server.
Given the high baseline http_req_waiting - I think plugin server-based ingestion OR go api ingestion will help a lot

Using rrweb.pack

Realized my metrics in previous post were wrong. Compressing individual events wouldn't be a huge win as most of the server payload would be our normal object keys.

Adding this scheme would also break this pr as snapshot type would not be queriable on the server if don't also add decompression support on the server.

Conclusions / next steps

All-in-all it seems getting session recording more reliable requires:

Making requests smaller - deflate achieves that
Making our ingestion API endpoints faster to avoid client-side timeouts.

Achieving 1 is easy - we just add deflate support to the server. However 2 is trickier since even plain python endpoints are slow and decompression is bound to slow things down further. It might be achieved a variety of ways:

Go-based ingestion endpoints
plugin-server also handles /e/ and /s/ endpoint ingestion

~~Gonna put this on pause until I have a better understanding of what's going on with plugins/chat with Tim.~~

We'll ship deflate compression within recorder.js as part of this iteration within the python server.

timgl commented 3 years ago

Based on discussion, we'll move forward with pako.deflate, just for recorder.js. The tradeoff of extra library size is offset by the extra compression we'll get, as outbound upload speed seems to be the slowest piece. For now, we'll keep LZstring compression for normal events too.

macobo commented 3 years ago

We've since implemented a gzip-based compression scheme which has been running for a couple of days for app.posthog.com.

The resulting payloads are 2-3x smaller than with lz64 on app.posthog.com. It's hard to get good numbers on anything else though due to instability in the app bleeding into these stats.

There's no new errors on the backend so we'll roll this out to all users (who use array.js or update npm) tomorrow. :)

We should put focus into making our endpoint fast as well though, the p99 metrics are pretty scary and are leading to some data loss.

macobo commented 3 years ago

Before and after rolling gzip support out to everyone on app.posthog.com:

Screenshot from 2020-12-11 18-00-24

Yuyz0112 commented 3 years ago

Are you interested in trying https://github.com/101arrowz/fflate as the client-side compression lib?

macobo commented 3 years ago

We do now! https://github.com/PostHog/posthog-js/blob/master/CHANGELOG.md

PostHog / posthog

Track session recording internal metrics #2560

Is your feature request related to a problem?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Thank you for your feature request – we love each and every one!

Next actions

Future

LZString benchmarks

Time

Compression ratios

Benchmarks

compression ratios (compressed size / uncompressed size)

lzstring vs pako.deflate speed

Package sizes

What we know from this?

Next steps

Notes on tooling

Benchmarking our endpoints

Using rrweb.pack

Conclusions / next steps