Closed prawnsalad closed 3 years ago
@prawnsalad Thank you so much for this super detailed and well thought out issue. We all really appreciated work you put into this.
First off your assumptions are totally correct.
I agree with almost everything you suggested here and will make sure that it is incorporated into the work we are doing to improve the performance of event ingestion. For a larger installation we would want to split traffic for PostHog and event ingestion (with the current setup) so that we didn't impact PostHog when a flood of events come in. Typically we would use a reverse proxy to send events to a dedicated number of Django instances just handling events keeping this pool isolated from another pool that handled analytics requests. This would protect the analytics user from having the analytics side taken down by the flood of event requests. Longer term I do think it would make sense to rewrite the event pipeline in something a bit more low level like rust or golang as an optimization.
In general though I think you nailed it with suggesting that we should be more aggressive with our caching and batching across our event handling pipeline.
Wondering how relevant this still is given how much our architecture has changed over the last 6 months.
Not that we have necessarily implemented all of these, but the real question is: will this issue be referenced again in the future?
I think we can close this now.
I've been testing out PostHog on a % of traffic of my web application, to monitor usage, how far people dig into the application, and what features people are actually using.
The setup (I used docker-compose) was incredibly simple. Getting into the events and creating trends+funnels delivered a lot of important insight very quickly out of the box which is incredible valuable for ease of use. I am now looking to see if PostHog is ideal longer term when it comes to traffic spikes, data storage, ease of scaling out, etc.
Here are some high-level observations when it comes to event ingestion specifically. Some may have already been thought about and discussed or may not be in-line with your project aims or i might have even missed why some things are necessary. I also don't know Django so some points may not make sense in that context. I apologise for this lengthy issue if any of these are the case!
As I start to increase event traffic to PostHog I notice that CPU usage goes crazy and the UI stalls forever until the event request backlog has cleared - which I have seen to be anywhere between 1 minute to 20 minutes.
Assumptions:
Notes:
Observations:
I notice that the worker processes are almost always far more idle than the master processes. I don't know django so this could be expected and normal, but I would have expected the workers to be doing most of the work? https://imgur.com/hQ6TjKj
Current executed queries when inserting a single event with a new user and then a second single event with the same user:
With the changes listed above, these queries for a high majority of event calls could be reduced to: