Missing session recordings for user

macobo commented 3 years ago

Bug description

This came from a client report: https://posthog.slack.com/archives/C01HY5TRZP0/p1610460342002100

For this user, session recording was not working despite being enabled. Digging in a bit more, our db correctly contained a lot of events, but not full payload ones.

Expected behavior

Session recording events are correctly received

How to reproduce

Visit site referenced via slack above. See no recordings show up.

Environment

All

Additional context

This only seems to happen relatively rarely globally, but consistently for this site.

I dug into this technically and what I found was pretty fascinating:

We gzip request payloads using fflate and send them to /s. I noticed the largest initial request containing the full snapshot event got a response {status: 1}
Inspecting the payload, there's ~6 NULL bytes was near the beginning. According to my googling, null bytes should be valid for gzip compression.
Copying the curl command and sending it to localhost, capture fails due to Not a gzipped file.
Inspecting request.body locally reveals only bytes up until the NULL bytes are read, even if content-length indicates it should be larger.

I asked about this behavior on django IRC + googled, but could not find anything indicating whether NULL bytes in request body are unacceptable or if django is behaving wrong here.

Diagnosis

At least locally, django uses runserver.py behind the hood, which in turn uses a WSGIServer.

In the end, the thing that reads the request body buffer is the following BufferedIO cpython implementation Which assumes NULL bytes it reads are to be ignored.

Not sure if that's correct or not, but that's our world. Not 100% sure that gunicorn behaves in an identical way.

Misc context

Example sentry error: https://sentry.io/organizations/posthog/issues/2079528998/events/d8bf9c6fca384ee988d1288073da5ae4/?project=1899813&statsPeriod=14d (Not a gzipped file (b'\x1f\x00')), though my own is not reported.

Curl request for reproducing NULL byte behavior:

curl 'localhost:8000/s/?compression=gzip-js&ip=1&_=1610460868479' \
  -H 'authority: app.posthog.com' \
  -H 'user-agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36' \
  -H 'dnt: 1' \
  -H 'content-type: text/plain' \
  -H 'accept: */*' \
  -H 'sec-fetch-site: cross-site' \
  -H 'sec-fetch-mode: cors' \
  -H 'sec-fetch-dest: empty' \
  -H 'accept-language: en-GB,en;q=0.9,en-US;q=0.8,et;q=0.7' \
  -H 'cookie: phtoolbar=yes' \
  --data-binary $'\u0008\u0000\u0000\u0000\u0000\u0000\u0000\u0003' \
  --compressed

Next steps

To bypass CORS we send the request as content-type text/plain. Using application/x-www-form-urlencoded would bloat the request payload size significantly. One thing worth trying is multipart/form-data.

Extra complexity here might come from needing to make sure newer posthog-js stays compatible with older versions of posthog.

It is also worth updating fflate, which has had some newer developments recently.

Thank you for your bug report – we love squashing them!

macobo commented 3 years ago

Seems like this was an issue with CURL rather than event capture - set up a test suite (code below for reference) and this problem did not occur when the request was made by the browser.

Something else is (still) going wrong on that company, not sure why session recording is not working.

POC test

/// <reference types="cypress" />

describe('Event capture', () => {
    given('options', () => ({ capture_pageview: false }))
    given('sessionRecording', () => false)
    given('supportedCompression', () => ['gzip', 'lz64'])

    // :TRICKY: Use a custom start command over beforeEach to deal with given2 not being ready yet.
    const start = () => {
        cy.route({
            method: 'POST',
            url: '**/decide/*',
            response: {
                config: { enable_collect_everything: true },
                editorParams: {},
                featureFlags: ['session-recording-player'],
                isAuthenticated: false,
                supportedCompression: ['gzip-js', 'lz64'],
            },
        }).as('decide')

        cy.visit('./playground/cypress')
        cy.posthogInit(given.options)
    }

    it('works', () => {
        start()

        cy.readFile('../posthog/string.json').then((body) => {
            console.log(body)
            cy.posthog().then(($posthog) => {
                $posthog._handle_queued_event('http://localhost:8000/s', body)
                console.log($posthog._handle_queued_event)
            })
        })
    })
})

macobo commented 3 years ago

Reproducing

Managed to reproduce this locally by:

Blocking array.js from prod for myself, using bookmarklet from localhost

Similar to above, normal events show up but full_snapshot ones do not in the db.

Extra scripts for historic purposes

Digging in, I saved the full snapshot events to a file:

    if len(data["snapshot_data"]) > 1000000:
        open('dump.json', 'w').write(json.dumps(data))

And attempted to see run this via python shell:

from ee.clickhouse.sql.session_recording_events import INSERT_SESSION_RECORDING_EVENT_SQL
from ee.kafka_client.client import ClickhouseProducer
from ee.kafka_client.topics import KAFKA_SESSION_RECORDING_EVENTS
from datetime import datetime
import json
data = json.loads(open('dump.json').read())
p = ClickhouseProducer()
produce = lambda: p.produce(sql=INSERT_SESSION_RECORDING_EVENT_SQL, topic=KAFKA_SESSION_RECORDING_EVENTS, data=data)
produce() or datetime.now()

Nothing interesting shows up after this in clickhouse logs: docker cp posthog_clickhouse_1:/var/log/clickhouse-server/clickhouse-server.log chserver.log; less chserver.log

Local fix

I think this narrows things down to either kafka or clickhouse. For this site, the raw json message is 3962402 bytes. Might we be running into kafka message size limitations?

As per https://stackoverflow.com/questions/21020347/how-can-i-send-large-messages-with-kafka-over-15mb I did a couple of changes locally:

diff --git a/ee/docker-compose.ch.yml b/ee/docker-compose.ch.yml
index e70a5aed..b2dfdf59 100644
--- a/ee/docker-compose.ch.yml
+++ b/ee/docker-compose.ch.yml
@@ -106,3 +106,6 @@ services:
         environment:
             KAFKA_ADVERTISED_HOST_NAME: kafka
             KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
+            KAFKA_MESSAGE_MAX_BYTES: 1000048588
+            KAFKA_REPLICA_FETCH_MAX_BYTES: 1000048588
+            KAFKA_FETCH_MESSAGE_MAX_BYTES: 1000048588
diff --git a/ee/kafka_client/client.py b/ee/kafka_client/client.py
index 7c7f9aed..fad6a08e 100644
--- a/ee/kafka_client/client.py
+++ b/ee/kafka_client/client.py
@@ -34,7 +34,7 @@ class _KafkaProducer:
         elif KAFKA_BASE64_KEYS:
             self.producer = helper.get_kafka_producer(value_serializer=lambda d: d)
         else:
-            self.producer = KP(bootstrap_servers=KAFKA_HOSTS)
+            self.producer = KP(bootstrap_servers=KAFKA_HOSTS, max_request_size=1000048588)

     @staticmethod
     def json_serializer(d):

After this session recording worked.

@fuziontech mind handling bumping these in production kafka cluster and in our producers?

macobo commented 3 years ago

Tried out sending the raw insert statement to clickhouse as suggested by @fuziontech. This however runs into max query size limits:


0. DB::parseQueryAndMovePosition(DB::IParser&, char const*&, char const*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, bool, unsigned long, unsigned long) @ 0xe938c39 in /usr/bin/clickhouse
1. ? @ 0xdeca297 in /usr/bin/clickhouse
2. DB::executeQuery(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, DB::Context&, bool, DB::QueryProcessingStage::Enum, bool) @ 0xdec9e9d in /usr/bin/clickhouse
3. DB::TCPHandler::runImpl() @ 0xe56a3b6 in /usr/bin/clickhouse
4. DB::TCPHandler::run() @ 0xe5771b7 in /usr/bin/clickhouse
5. Poco::Net::TCPServerConnection::start() @ 0x10d4f71f in /usr/bin/clickhouse
6. Poco::Net::TCPServerDispatcher::run() @ 0x10d5112e in /usr/bin/clickhouse
7. Poco::PooledThread::run() @ 0x10e823e9 in /usr/bin/clickhouse
8. Poco::ThreadImpl::runnableEntry(void*) @ 0x10e7e31a in /usr/bin/clickhouse
9. start_thread @ 0x9609 in /usr/lib/x86_64-linux-gnu/libpthread-2.31.so
10. clone @ 0x122293 in /usr/lib/x86_64-linux-gnu/libc-2.31.so

We're currently on Heroku with our kafka setup, but we'll move off of it soon. Hopefully we can update the max sizes then.

@fuziontech - assigning you this issue until kafka cluster has moved.

macobo commented 3 years ago

Another user reported this error.

paolodamico commented 3 years ago

Every week, 1-2 users mention that they have problems getting all their sessions recorded. I usually just explain them our technical limitations just yet, but is there some other more helpful information we can ask them to improve this?

In addition to the above, for reference a user mentioned recently that PostHog captures somewhere between 30-50% of their sessions while FullStory captures 100%.

macobo commented 3 years ago

but is there some other more helpful information we can ask them to improve this?

Not right now. These are pretty clear product/technical limitations we're dealing with here and both getting visibility and fixing them requires development focus from our side.

paolodamico commented 2 years ago

What's the state of this? @rcmarron @alexkim205

rcmarron commented 2 years ago

For customers who have upgraded posthog-js, this seems to be resolved. The percent of "missing recordings" is down from 10-50% (depending on their website) to <1%.

At some point, we should look into the remaining 1%, but I think it's safe to close for now.

PostHog / posthog