Event ingestion performance

prawnsalad commented 4 years ago

I've been testing out PostHog on a % of traffic of my web application, to monitor usage, how far people dig into the application, and what features people are actually using.

The setup (I used docker-compose) was incredibly simple. Getting into the events and creating trends+funnels delivered a lot of important insight very quickly out of the box which is incredible valuable for ease of use. I am now looking to see if PostHog is ideal longer term when it comes to traffic spikes, data storage, ease of scaling out, etc.

Here are some high-level observations when it comes to event ingestion specifically. Some may have already been thought about and discussed or may not be in-line with your project aims or i might have even missed why some things are necessary. I also don't know Django so some points may not make sense in that context. I apologise for this lengthy issue if any of these are the case!

As I start to increase event traffic to PostHog I notice that CPU usage goes crazy and the UI stalls forever until the event request backlog has cleared - which I have seen to be anywhere between 1 minute to 20 minutes.

Assumptions:

Teams, actions and webhooks rarely change
When using PostHog for product events, not all users will have properties
For apps and websites without user accounts, users will rarely have properties
Initial PostHog setup + initial event ingestion can afford to have slightly lower performance as this happens rarely. The vast majority of the time is spent collecting raw events
PostHog will almost always be write heavy rather than read heavy
PostHog is not focussing on real-time stats but instead near real-time as much as possible.

Notes:

In-app cache refers to an in-memory cache object per-worker, which items may expire after X duration or manually when something changes

Observations:

Have a separate worker for the frontend. When event ingestion gets clogged up it currently takes down everything for a long while. A nice benefit for this separation is that the frontend process may be able to detect higher ingestion latencies and report it as a warning in the UI.
Db ingestion doesn't appear to be batched causing very poor performance and server freezes. Some possibilities:
- Insert a data blob into a redis list (or multiple) and batch insert them into the db when it reaches X length or Y secs has elapsed
- Would allow third party apps to write directly into the queue for tighter integration + very fast inserts
- Insert events into a redis stream (stream per worker for eg.) and have workers read batches from those
- Would allow third party apps to write directly into the queue for tighter integration + very fast inserts
- Store events to an in-app cache and write them in a batch after X duration. Downside is that data-loss could occur if the process is killed
Remove foreign key constraints from the event table. This table is write heavy and the constraints have a huge impact on write performance as the table grows. It is more performant to treat the event table as a write log since it is used in an olap manner.
- This would also open the door to storing events into a different and better suited data store in future. ie. events into a time series db or scylla for large and incredibly fast datasets while keeping the rest of the app data in postgres for the general oltp usage. (having the storage type optional and keeping pgres as default for ease of use, and it works in most cases already)
Is there a need to lookup+add to the person table for every event? It seems that if a person has no properties then there is no need to store it in the table. This would save a lookup+insert for every event. Perhaps only inserting a person on alias and identify calls.
Use an in-app cache for webhook lookups. Currently every event insert makes a db lookup for a webhook and 99% of the time the result will always be the same.
Use an in-app cache for the following (and possibly others) as the majority of the time they won't change.
- posthog_team.event_names
- posthog_team.event_properties

I notice that the worker processes are almost always far more idle than the master processes. I don't know django so this could be expected and normal, but I would have expected the workers to be doing most of the work? https://imgur.com/hQ6TjKj

Current executed queries when inserting a single event with a new user and then a second single event with the same user:

new user:
2020-06-08 12:20:38.283 UTC [42178] LOG: statement: SELECT "posthog_team"."id", "posthog_team"."slack_incoming_webhook", "posthog_team"."event_names", "posthog_team"."event_properties" FROM "posthog_team" WHERE "posthog_team"."id" = 1 LIMIT 21
2020-06-08 12:20:38.284 UTC [42178] LOG: statement: BEGIN
2020-06-08 12:20:38.285 UTC [42178] LOG: statement: INSERT INTO "posthog_event" ("team_id", "event", "distinct_id", "properties", "elements", "timestamp", "elements_hash") VALUES (1, '$apphit', 'testuser', '{"foo": "bar", "$lib": "posthog-node", "$lib_version": "1.0.3", "$ip": "172.23.0.1"}', '[]', '2020-06-08T12:20:38.255000+00:00'::timestamptz, NULL) RETURNING "posthog_event"."id"
2020-06-08 12:20:38.286 UTC [42178] LOG: statement: SELECT MAX("posthog_action"."updated_at") AS "updated_at__max" FROM "posthog_action" WHERE "posthog_action"."team_id" = 1
2020-06-08 12:20:38.288 UTC [42178] LOG: statement: SELECT DISTINCT ON ("posthog_action"."id") "posthog_action"."id", "posthog_action"."name", "posthog_action"."team_id", "posthog_action"."created_at", "posthog_action"."created_by_id", "posthog_action"."deleted", "posthog_action"."post_to_slack", "posthog_action"."is_calculating", "posthog_action"."updated_at" FROM "posthog_action" INNER JOIN "posthog_actionstep" ON ("posthog_action"."id" = "posthog_actionstep"."action_id") WHERE ("posthog_action"."deleted" = false AND "posthog_actionstep"."event" = '$apphit' AND "posthog_action"."team_id" = 1)
2020-06-08 12:20:38.289 UTC [42178] LOG: statement: COMMIT
2020-06-08 12:20:38.290 UTC [42178] LOG: statement: UPDATE "posthog_team" SET "slack_incoming_webhook" = NULL, "event_names" = '["$pageview", "$apphit", "$app_network_connecting", "$app_captcha_requested", "$app_network_error", "$app_network_connected", "$app_ping"]', "event_properties" = '["$current_url", "$lib", "$lib_version", "$ip", "embedded", "server", "message", "foo"]' WHERE "posthog_team"."id" = 1
2020-06-08 12:20:38.292 UTC [42178] LOG: statement: SELECT (1) AS "a" FROM "posthog_persondistinctid" WHERE ("posthog_persondistinctid"."distinct_id" IN ('testuser') AND "posthog_persondistinctid"."team_id" = 1) LIMIT 1
2020-06-08 12:20:38.293 UTC [42178] LOG: statement: BEGIN
2020-06-08 12:20:38.293 UTC [42178] LOG: statement: INSERT INTO "posthog_person" ("created_at", "team_id", "properties", "is_user_id") VALUES ('2020-06-08T12:20:38.292940+00:00'::timestamptz, 1, '{}', NULL) RETURNING "posthog_person"."id"
2020-06-08 12:20:38.294 UTC [42178] LOG: statement: INSERT INTO "posthog_persondistinctid" ("team_id", "person_id", "distinct_id") VALUES (1, 63873, 'testuser') RETURNING "posthog_persondistinctid"."id"
2020-06-08 12:20:38.294 UTC [42178] LOG: statement: COMMIT

event1:
2020-06-08 12:21:44.095 UTC [42181] LOG: statement: SELECT "posthog_team"."id", "posthog_team"."slack_incoming_webhook", "posthog_team"."event_names", "posthog_team"."event_properties" FROM "posthog_team" WHERE "posthog_team"."id" = 1 LIMIT 21
2020-06-08 12:21:44.096 UTC [42181] LOG: statement: BEGIN
2020-06-08 12:21:44.097 UTC [42181] LOG: statement: INSERT INTO "posthog_event" ("team_id", "event", "distinct_id", "properties", "elements", "timestamp", "elements_hash") VALUES (1, '$apphit', 'testuser', '{"foo": "bar", "$lib": "posthog-node", "$lib_version": "1.0.3", "$ip": "172.23.0.1"}', '[]', '2020-06-08T12:21:44.070000+00:00'::timestamptz, NULL) RETURNING "posthog_event"."id"
2020-06-08 12:21:44.098 UTC [42181] LOG: statement: SELECT MAX("posthog_action"."updated_at") AS "updated_at__max" FROM "posthog_action" WHERE "posthog_action"."team_id" = 1
2020-06-08 12:21:44.100 UTC [42181] LOG: statement: SELECT DISTINCT ON ("posthog_action"."id") "posthog_action"."id", "posthog_action"."name", "posthog_action"."team_id", "posthog_action"."created_at", "posthog_action"."created_by_id", "posthog_action"."deleted", "posthog_action"."post_to_slack", "posthog_action"."is_calculating", "posthog_action"."updated_at" FROM "posthog_action" INNER JOIN "posthog_actionstep" ON ("posthog_action"."id" = "posthog_actionstep"."action_id") WHERE ("posthog_action"."deleted" = false AND "posthog_actionstep"."event" = '$apphit' AND "posthog_action"."team_id" = 1)
2020-06-08 12:21:44.101 UTC [42181] LOG: statement: COMMIT
2020-06-08 12:21:44.102 UTC [42181] LOG: statement: SELECT (1) AS "a" FROM "posthog_persondistinctid" WHERE ("posthog_persondistinctid"."distinct_id" IN ('testuser') AND "posthog_persondistinctid"."team_id" = 1) LIMIT 1

With the changes listed above, these queries for a high majority of event calls could be reduced to:

single event insert:
2020-06-08 12:20:38.285 UTC [42178] LOG: statement: INSERT INTO "posthog_event" ("team_id", "event", "distinct_id", "properties", "elements", "timestamp", "elements_hash") VALUES (1, '$apphit', 'testuser', '{"foo": "bar", "$lib": "posthog-node", "$lib_version": "1.0.3", "$ip": "172.23.0.1"}', '[]', '2020-06-08T12:20:38.255000+00:00'::timestamptz, NULL)

Batch of 5 events:
2020-06-08 12:20:38.285 UTC [42178] LOG: statement: INSERT INTO "posthog_event" ("team_id", "event", "distinct_id", "properties", "elements", "timestamp", "elements_hash")
                                                    VALUES (1, '$apphit', 'testuser', '{"foo": "bar", "$lib": "posthog-node", "$lib_version": "1.0.3", "$ip": "172.23.0.1"}', '[]', '2020-06-08T12:20:38.255000+00:00'::timestamptz, NULL),
                                                    (1, '$apphit', 'testuser', '{"foo": "bar", "$lib": "posthog-node", "$lib_version": "1.0.3", "$ip": "172.23.0.1"}', '[]', '2020-06-08T12:20:38.255000+00:00'::timestamptz, NULL),
                                                    (1, '$apphit', 'testuser', '{"foo": "bar", "$lib": "posthog-node", "$lib_version": "1.0.3", "$ip": "172.23.0.1"}', '[]', '2020-06-08T12:20:38.255000+00:00'::timestamptz, NULL),
                                                    (1, '$apphit', 'testuser', '{"foo": "bar", "$lib": "posthog-node", "$lib_version": "1.0.3", "$ip": "172.23.0.1"}', '[]', '2020-06-08T12:20:38.255000+00:00'::timestamptz, NULL),
                                                    (1, '$apphit', 'testuser', '{"foo": "bar", "$lib": "posthog-node", "$lib_version": "1.0.3", "$ip": "172.23.0.1"}', '[]', '2020-06-08T12:20:38.255000+00:00'::timestamptz, NULL);

fuziontech commented 4 years ago

@prawnsalad Thank you so much for this super detailed and well thought out issue. We all really appreciated work you put into this.

First off your assumptions are totally correct.

I agree with almost everything you suggested here and will make sure that it is incorporated into the work we are doing to improve the performance of event ingestion. For a larger installation we would want to split traffic for PostHog and event ingestion (with the current setup) so that we didn't impact PostHog when a flood of events come in. Typically we would use a reverse proxy to send events to a dedicated number of Django instances just handling events keeping this pool isolated from another pool that handled analytics requests. This would protect the analytics user from having the analytics side taken down by the flood of event requests. Longer term I do think it would make sense to rewrite the event pipeline in something a bit more low level like rust or golang as an optimization.

In general though I think you nailed it with suggesting that we should be more aggressive with our caching and batching across our event handling pipeline.

yakkomajuri commented 3 years ago

Wondering how relevant this still is given how much our architecture has changed over the last 6 months.

Not that we have necessarily implemented all of these, but the real question is: will this issue be referenced again in the future?

Twixes commented 3 years ago

I think we can close this now.

PostHog / posthog

Event ingestion performance #955