Closed macobo closed 2 years ago
Seems like this was an issue with CURL rather than event capture - set up a test suite (code below for reference) and this problem did not occur when the request was made by the browser.
Something else is (still) going wrong on that company, not sure why session recording is not working.
/// <reference types="cypress" />
describe('Event capture', () => {
given('options', () => ({ capture_pageview: false }))
given('sessionRecording', () => false)
given('supportedCompression', () => ['gzip', 'lz64'])
// :TRICKY: Use a custom start command over beforeEach to deal with given2 not being ready yet.
const start = () => {
cy.route({
method: 'POST',
url: '**/decide/*',
response: {
config: { enable_collect_everything: true },
editorParams: {},
featureFlags: ['session-recording-player'],
isAuthenticated: false,
supportedCompression: ['gzip-js', 'lz64'],
},
}).as('decide')
cy.visit('./playground/cypress')
cy.posthogInit(given.options)
}
it('works', () => {
start()
cy.readFile('../posthog/string.json').then((body) => {
console.log(body)
cy.posthog().then(($posthog) => {
$posthog._handle_queued_event('http://localhost:8000/s', body)
console.log($posthog._handle_queued_event)
})
})
})
})
Managed to reproduce this locally by:
Similar to above, normal events show up but full_snapshot ones do not in the db.
Digging in, I saved the full snapshot events to a file:
if len(data["snapshot_data"]) > 1000000:
open('dump.json', 'w').write(json.dumps(data))
And attempted to see run this via python shell:
from ee.clickhouse.sql.session_recording_events import INSERT_SESSION_RECORDING_EVENT_SQL
from ee.kafka_client.client import ClickhouseProducer
from ee.kafka_client.topics import KAFKA_SESSION_RECORDING_EVENTS
from datetime import datetime
import json
data = json.loads(open('dump.json').read())
p = ClickhouseProducer()
produce = lambda: p.produce(sql=INSERT_SESSION_RECORDING_EVENT_SQL, topic=KAFKA_SESSION_RECORDING_EVENTS, data=data)
produce() or datetime.now()
Nothing interesting shows up after this in clickhouse logs: docker cp posthog_clickhouse_1:/var/log/clickhouse-server/clickhouse-server.log chserver.log; less chserver.log
I think this narrows things down to either kafka or clickhouse. For this site, the raw json message is 3962402 bytes. Might we be running into kafka message size limitations?
As per https://stackoverflow.com/questions/21020347/how-can-i-send-large-messages-with-kafka-over-15mb I did a couple of changes locally:
diff --git a/ee/docker-compose.ch.yml b/ee/docker-compose.ch.yml
index e70a5aed..b2dfdf59 100644
--- a/ee/docker-compose.ch.yml
+++ b/ee/docker-compose.ch.yml
@@ -106,3 +106,6 @@ services:
environment:
KAFKA_ADVERTISED_HOST_NAME: kafka
KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
+ KAFKA_MESSAGE_MAX_BYTES: 1000048588
+ KAFKA_REPLICA_FETCH_MAX_BYTES: 1000048588
+ KAFKA_FETCH_MESSAGE_MAX_BYTES: 1000048588
diff --git a/ee/kafka_client/client.py b/ee/kafka_client/client.py
index 7c7f9aed..fad6a08e 100644
--- a/ee/kafka_client/client.py
+++ b/ee/kafka_client/client.py
@@ -34,7 +34,7 @@ class _KafkaProducer:
elif KAFKA_BASE64_KEYS:
self.producer = helper.get_kafka_producer(value_serializer=lambda d: d)
else:
- self.producer = KP(bootstrap_servers=KAFKA_HOSTS)
+ self.producer = KP(bootstrap_servers=KAFKA_HOSTS, max_request_size=1000048588)
@staticmethod
def json_serializer(d):
After this session recording worked.
@fuziontech mind handling bumping these in production kafka cluster and in our producers?
Tried out sending the raw insert statement to clickhouse as suggested by @fuziontech. This however runs into max query size limits:
0. DB::parseQueryAndMovePosition(DB::IParser&, char const*&, char const*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, bool, unsigned long, unsigned long) @ 0xe938c39 in /usr/bin/clickhouse
1. ? @ 0xdeca297 in /usr/bin/clickhouse
2. DB::executeQuery(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, DB::Context&, bool, DB::QueryProcessingStage::Enum, bool) @ 0xdec9e9d in /usr/bin/clickhouse
3. DB::TCPHandler::runImpl() @ 0xe56a3b6 in /usr/bin/clickhouse
4. DB::TCPHandler::run() @ 0xe5771b7 in /usr/bin/clickhouse
5. Poco::Net::TCPServerConnection::start() @ 0x10d4f71f in /usr/bin/clickhouse
6. Poco::Net::TCPServerDispatcher::run() @ 0x10d5112e in /usr/bin/clickhouse
7. Poco::PooledThread::run() @ 0x10e823e9 in /usr/bin/clickhouse
8. Poco::ThreadImpl::runnableEntry(void*) @ 0x10e7e31a in /usr/bin/clickhouse
9. start_thread @ 0x9609 in /usr/lib/x86_64-linux-gnu/libpthread-2.31.so
10. clone @ 0x122293 in /usr/lib/x86_64-linux-gnu/libc-2.31.so
We're currently on Heroku with our kafka setup, but we'll move off of it soon. Hopefully we can update the max sizes then.
@fuziontech - assigning you this issue until kafka cluster has moved.
Another user reported this error.
Every week, 1-2 users mention that they have problems getting all their sessions recorded. I usually just explain them our technical limitations just yet, but is there some other more helpful information we can ask them to improve this?
In addition to the above, for reference a user mentioned recently that PostHog captures somewhere between 30-50% of their sessions while FullStory captures 100%.
but is there some other more helpful information we can ask them to improve this?
Not right now. These are pretty clear product/technical limitations we're dealing with here and both getting visibility and fixing them requires development focus from our side.
What's the state of this? @rcmarron @alexkim205
For customers who have upgraded posthog-js, this seems to be resolved. The percent of "missing recordings" is down from 10-50% (depending on their website) to <1%.
At some point, we should look into the remaining 1%, but I think it's safe to close for now.
Bug description
This came from a client report: https://posthog.slack.com/archives/C01HY5TRZP0/p1610460342002100
For this user, session recording was not working despite being enabled. Digging in a bit more, our db correctly contained a lot of events, but not full payload ones.
Expected behavior
Session recording events are correctly received
How to reproduce
Visit site referenced via slack above. See no recordings show up.
Environment
All
Additional context
This only seems to happen relatively rarely globally, but consistently for this site.
I dug into this technically and what I found was pretty fascinating:
Not a gzipped file
.request.body
locally reveals only bytes up until the NULL bytes are read, even if content-length indicates it should be larger.I asked about this behavior on django IRC + googled, but could not find anything indicating whether NULL bytes in request body are unacceptable or if django is behaving wrong here.
Diagnosis
At least locally, django uses runserver.py behind the hood, which in turn uses a WSGIServer.
In the end, the thing that reads the request body buffer is the following BufferedIO cpython implementation Which assumes NULL bytes it reads are to be ignored.
Not sure if that's correct or not, but that's our world. Not 100% sure that gunicorn behaves in an identical way.
Misc context
Example sentry error: https://sentry.io/organizations/posthog/issues/2079528998/events/d8bf9c6fca384ee988d1288073da5ae4/?project=1899813&statsPeriod=14d (Not a gzipped file (b'\x1f\x00')), though my own is not reported.
Curl request for reproducing NULL byte behavior:
Next steps
To bypass CORS we send the request as content-type text/plain. Using application/x-www-form-urlencoded would bloat the request payload size significantly. One thing worth trying is multipart/form-data.
Extra complexity here might come from needing to make sure newer posthog-js stays compatible with older versions of posthog.
It is also worth updating fflate, which has had some newer developments recently.
Thank you for your bug report – we love squashing them!