PostHog / posthog

🦔 PostHog provides open-source product analytics, session recording, feature flagging and A/B testing that you can self-host.
https://posthog.com
Other
21.36k stars 1.28k forks source link

Event ingestion performance #955

Closed prawnsalad closed 3 years ago

prawnsalad commented 4 years ago

I've been testing out PostHog on a % of traffic of my web application, to monitor usage, how far people dig into the application, and what features people are actually using.

The setup (I used docker-compose) was incredibly simple. Getting into the events and creating trends+funnels delivered a lot of important insight very quickly out of the box which is incredible valuable for ease of use. I am now looking to see if PostHog is ideal longer term when it comes to traffic spikes, data storage, ease of scaling out, etc.

Here are some high-level observations when it comes to event ingestion specifically. Some may have already been thought about and discussed or may not be in-line with your project aims or i might have even missed why some things are necessary. I also don't know Django so some points may not make sense in that context. I apologise for this lengthy issue if any of these are the case!

As I start to increase event traffic to PostHog I notice that CPU usage goes crazy and the UI stalls forever until the event request backlog has cleared - which I have seen to be anywhere between 1 minute to 20 minutes.

Assumptions:

Notes:

Observations:

I notice that the worker processes are almost always far more idle than the master processes. I don't know django so this could be expected and normal, but I would have expected the workers to be doing most of the work? https://imgur.com/hQ6TjKj

Current executed queries when inserting a single event with a new user and then a second single event with the same user:

new user:
2020-06-08 12:20:38.283 UTC [42178] LOG: statement: SELECT "posthog_team"."id", "posthog_team"."slack_incoming_webhook", "posthog_team"."event_names", "posthog_team"."event_properties" FROM "posthog_team" WHERE "posthog_team"."id" = 1 LIMIT 21
2020-06-08 12:20:38.284 UTC [42178] LOG: statement: BEGIN
2020-06-08 12:20:38.285 UTC [42178] LOG: statement: INSERT INTO "posthog_event" ("team_id", "event", "distinct_id", "properties", "elements", "timestamp", "elements_hash") VALUES (1, '$apphit', 'testuser', '{"foo": "bar", "$lib": "posthog-node", "$lib_version": "1.0.3", "$ip": "172.23.0.1"}', '[]', '2020-06-08T12:20:38.255000+00:00'::timestamptz, NULL) RETURNING "posthog_event"."id"
2020-06-08 12:20:38.286 UTC [42178] LOG: statement: SELECT MAX("posthog_action"."updated_at") AS "updated_at__max" FROM "posthog_action" WHERE "posthog_action"."team_id" = 1
2020-06-08 12:20:38.288 UTC [42178] LOG: statement: SELECT DISTINCT ON ("posthog_action"."id") "posthog_action"."id", "posthog_action"."name", "posthog_action"."team_id", "posthog_action"."created_at", "posthog_action"."created_by_id", "posthog_action"."deleted", "posthog_action"."post_to_slack", "posthog_action"."is_calculating", "posthog_action"."updated_at" FROM "posthog_action" INNER JOIN "posthog_actionstep" ON ("posthog_action"."id" = "posthog_actionstep"."action_id") WHERE ("posthog_action"."deleted" = false AND "posthog_actionstep"."event" = '$apphit' AND "posthog_action"."team_id" = 1)
2020-06-08 12:20:38.289 UTC [42178] LOG: statement: COMMIT
2020-06-08 12:20:38.290 UTC [42178] LOG: statement: UPDATE "posthog_team" SET "slack_incoming_webhook" = NULL, "event_names" = '["$pageview", "$apphit", "$app_network_connecting", "$app_captcha_requested", "$app_network_error", "$app_network_connected", "$app_ping"]', "event_properties" = '["$current_url", "$lib", "$lib_version", "$ip", "embedded", "server", "message", "foo"]' WHERE "posthog_team"."id" = 1
2020-06-08 12:20:38.292 UTC [42178] LOG: statement: SELECT (1) AS "a" FROM "posthog_persondistinctid" WHERE ("posthog_persondistinctid"."distinct_id" IN ('testuser') AND "posthog_persondistinctid"."team_id" = 1) LIMIT 1
2020-06-08 12:20:38.293 UTC [42178] LOG: statement: BEGIN
2020-06-08 12:20:38.293 UTC [42178] LOG: statement: INSERT INTO "posthog_person" ("created_at", "team_id", "properties", "is_user_id") VALUES ('2020-06-08T12:20:38.292940+00:00'::timestamptz, 1, '{}', NULL) RETURNING "posthog_person"."id"
2020-06-08 12:20:38.294 UTC [42178] LOG: statement: INSERT INTO "posthog_persondistinctid" ("team_id", "person_id", "distinct_id") VALUES (1, 63873, 'testuser') RETURNING "posthog_persondistinctid"."id"
2020-06-08 12:20:38.294 UTC [42178] LOG: statement: COMMIT

event1:
2020-06-08 12:21:44.095 UTC [42181] LOG: statement: SELECT "posthog_team"."id", "posthog_team"."slack_incoming_webhook", "posthog_team"."event_names", "posthog_team"."event_properties" FROM "posthog_team" WHERE "posthog_team"."id" = 1 LIMIT 21
2020-06-08 12:21:44.096 UTC [42181] LOG: statement: BEGIN
2020-06-08 12:21:44.097 UTC [42181] LOG: statement: INSERT INTO "posthog_event" ("team_id", "event", "distinct_id", "properties", "elements", "timestamp", "elements_hash") VALUES (1, '$apphit', 'testuser', '{"foo": "bar", "$lib": "posthog-node", "$lib_version": "1.0.3", "$ip": "172.23.0.1"}', '[]', '2020-06-08T12:21:44.070000+00:00'::timestamptz, NULL) RETURNING "posthog_event"."id"
2020-06-08 12:21:44.098 UTC [42181] LOG: statement: SELECT MAX("posthog_action"."updated_at") AS "updated_at__max" FROM "posthog_action" WHERE "posthog_action"."team_id" = 1
2020-06-08 12:21:44.100 UTC [42181] LOG: statement: SELECT DISTINCT ON ("posthog_action"."id") "posthog_action"."id", "posthog_action"."name", "posthog_action"."team_id", "posthog_action"."created_at", "posthog_action"."created_by_id", "posthog_action"."deleted", "posthog_action"."post_to_slack", "posthog_action"."is_calculating", "posthog_action"."updated_at" FROM "posthog_action" INNER JOIN "posthog_actionstep" ON ("posthog_action"."id" = "posthog_actionstep"."action_id") WHERE ("posthog_action"."deleted" = false AND "posthog_actionstep"."event" = '$apphit' AND "posthog_action"."team_id" = 1)
2020-06-08 12:21:44.101 UTC [42181] LOG: statement: COMMIT
2020-06-08 12:21:44.102 UTC [42181] LOG: statement: SELECT (1) AS "a" FROM "posthog_persondistinctid" WHERE ("posthog_persondistinctid"."distinct_id" IN ('testuser') AND "posthog_persondistinctid"."team_id" = 1) LIMIT 1

With the changes listed above, these queries for a high majority of event calls could be reduced to:

single event insert:
2020-06-08 12:20:38.285 UTC [42178] LOG: statement: INSERT INTO "posthog_event" ("team_id", "event", "distinct_id", "properties", "elements", "timestamp", "elements_hash") VALUES (1, '$apphit', 'testuser', '{"foo": "bar", "$lib": "posthog-node", "$lib_version": "1.0.3", "$ip": "172.23.0.1"}', '[]', '2020-06-08T12:20:38.255000+00:00'::timestamptz, NULL)

Batch of 5 events:
2020-06-08 12:20:38.285 UTC [42178] LOG: statement: INSERT INTO "posthog_event" ("team_id", "event", "distinct_id", "properties", "elements", "timestamp", "elements_hash")
                                                    VALUES (1, '$apphit', 'testuser', '{"foo": "bar", "$lib": "posthog-node", "$lib_version": "1.0.3", "$ip": "172.23.0.1"}', '[]', '2020-06-08T12:20:38.255000+00:00'::timestamptz, NULL),
                                                    (1, '$apphit', 'testuser', '{"foo": "bar", "$lib": "posthog-node", "$lib_version": "1.0.3", "$ip": "172.23.0.1"}', '[]', '2020-06-08T12:20:38.255000+00:00'::timestamptz, NULL),
                                                    (1, '$apphit', 'testuser', '{"foo": "bar", "$lib": "posthog-node", "$lib_version": "1.0.3", "$ip": "172.23.0.1"}', '[]', '2020-06-08T12:20:38.255000+00:00'::timestamptz, NULL),
                                                    (1, '$apphit', 'testuser', '{"foo": "bar", "$lib": "posthog-node", "$lib_version": "1.0.3", "$ip": "172.23.0.1"}', '[]', '2020-06-08T12:20:38.255000+00:00'::timestamptz, NULL),
                                                    (1, '$apphit', 'testuser', '{"foo": "bar", "$lib": "posthog-node", "$lib_version": "1.0.3", "$ip": "172.23.0.1"}', '[]', '2020-06-08T12:20:38.255000+00:00'::timestamptz, NULL);
fuziontech commented 4 years ago

@prawnsalad Thank you so much for this super detailed and well thought out issue. We all really appreciated work you put into this.

First off your assumptions are totally correct.

I agree with almost everything you suggested here and will make sure that it is incorporated into the work we are doing to improve the performance of event ingestion. For a larger installation we would want to split traffic for PostHog and event ingestion (with the current setup) so that we didn't impact PostHog when a flood of events come in. Typically we would use a reverse proxy to send events to a dedicated number of Django instances just handling events keeping this pool isolated from another pool that handled analytics requests. This would protect the analytics user from having the analytics side taken down by the flood of event requests. Longer term I do think it would make sense to rewrite the event pipeline in something a bit more low level like rust or golang as an optimization.

In general though I think you nailed it with suggesting that we should be more aggressive with our caching and batching across our event handling pipeline.

yakkomajuri commented 3 years ago

Wondering how relevant this still is given how much our architecture has changed over the last 6 months.

Not that we have necessarily implemented all of these, but the real question is: will this issue be referenced again in the future?

Twixes commented 3 years ago

I think we can close this now.