PostHog / posthog

🦔 PostHog provides open-source product analytics, session recording, feature flagging and A/B testing that you can self-host.
https://posthog.com
Other
21.26k stars 1.26k forks source link

Tim's Performance Mega Issue Q2/Q3 2024 #22413

Open timgl opened 4 months ago

timgl commented 4 months ago

A dumping ground for all performance related things I'm looking at.

TODO's are things I want to look at, checkboxes are things that are ready to be worked on by me or others

Layers we go through when we make a query

UI

API calls/async queries

Caching

AST

Queries

Clickhouse

Tools I've built to look at performance

Reasons why an individual query times out

Done

Twixes commented 4 months ago

Quick Q: Does this list include all of https://github.com/PostHog/company-internal/issues/1379?

pauldambra commented 4 months ago

TODO: interface sometimes feels sluggish even on an M1 macbook. Profile the page.

maybe this https://posthog.slack.com/archives/C0113360FFV/p1715330140411809

tkaemming commented 2 months ago

This is mostly just a dump of my notes so a little scattered but hopefully there's some useful stuff in here.

Dashboards and tools to figure out what's happening

The overall query performance dashboard gives a good picture of what's going on across the cluster. It's biased towards online workloads because those queries are often more visible to users. This is a good place to start.

Many of these queries are composed of several snippets that are helpful to know about and are generally pretty self-explanatory, query type, high priority team IDs, and query features.

Query features are things that are known to sometimes slow down queries for ease of classification. The list is not comprehensive, and could probably use some attention from somebody who has more context. Just because a feature appears in a slow query does not imply that the presence of that feature is the reason the query is slow, but it can be a starting point for exploration or correlation, like in this example that shows error rate by feature or this example that shows error rate by feature and query type.

There is also an annotated query log question that includes some of this supplemental data and can be a useful starting point for new questions.

Types of problems we run into during query execution

Many of the questions linked to by the dashboard either group by, or can be filtered by the type of problem we encountered running a query.

Errors

MEMORY_LIMIT_EXCEEDED errors (234) are pretty easy to reproduce and relatively easy to identify whether or not they've been fixed after a change is made: queries run over the same dataset after tuning or other fixes will either start to work, or they won't. Other factors typically aren't significant enough to make these queries start (or stop) failing without the query or data being queried changing.

Typical causes

One or more of

Known issues
  1. Some stuff still needs to be moved to HogQL to make use of existing optimizations: it probably doesn't make much sense to duplicate optimizations such as overrides to old queries versus just updating them. This is also the case with several other types of queries: anything that still uses person_distinct_id2 is an obvious tell (feature pdi2-join), such as get_breakdown_prop_values, user_blast_radius, etc.
  2. Joins will probably need regular, somewhat continuous attention: the sessions join (feature sessions-join) is already starting to get big for some customers and causing queries to hit memory limits. The persons join can get big too (persons-join), but this seems to be largely (but not always) ameliorated by overrides changes.

It's difficult to know where the ceiling is for either of these. Maybe we should set up some alerting on memory_usage to let us know if we're starting to approach the ceiling? In particular because teams with a lot of data (and therefore typically high value ones) run into these issues first.

Slow queries

These come in two flavors, fatal and non-fatal slow queries.

The threshold for slow queries is defined in the SQL snippet defining problem types. It would probably make sense for this to be more discriminating than just applying a single there hold for all types of queries.

Fatal queries are TIMEOUT_EXCEEDED errors (159) and TOO_SLOW errors (160). The difference that we didn't wait for the result set to be computed, so we don't know how slow they actually would be to run to completion. Latency distributions are going to be skewed to towards lower values since these values end up either being excluded from the distribution, or included at an artificially low value. The fact that they don't return a result set to the user obviously makes the impact of a fatal slow query on user experience more significant.

Typical causes

One or more of:

These problems are not always reproducible: execution time can be significantly impacted by other cluster activity (backups, mutations, partial outages, etc) that causes IO contention or cause other capacity limitation/saturation issues. This can lead to lots of false positives when scanning the query log for potential optimization targets when something is only slow due to external circumstances. Some problematic queries can be slow enough to cause an error on an overloaded cluster might just be a query that is frustratingly slow but not so problematic to cause a timeout or get cancelled on a cluster under other conditions. This noise and variance can make it challenging to tell whether or not a change has had the intended effect without benchmarking independent of production paths.

We're also not going to be able to make all queries fast — queries that must read a ton of data are just going to be slow to execute. It is difficult to use the query log data for guidance here about where there are queries that have room for improvement and are candidates for optimization versus those that just inherently slow.

The room for improvement for these queries seems to be mostly in:

Levers we can pull to fix problems

Ordered roughly by degree of impact:

  1. Instance type changes: Better metal means faster queries due to increased IO throughput, faster CPUs, more memory, etc. ClickHouse team responsibility, mostly.
  2. Cluster topology: Better page cache utilization/affinity, reducing network traffic between nodes; also includes workload isolation so that different types of queries with different resource demands or latency expectations don't interfere with each other. Also mostly ClickHouse team.
  3. Generalizable improvements: High leverage changes (eg person overrides) that apply to many different types of queries. Involves schema trickery or query rewriting. Stuff like JSON optimizations, materialization work, etc.
  4. Hot spot optimization: Localized improvements targeted to improve specific queries. Requires a good deal of baseline understanding to be able to identify these changes, and make them efficiently and safely/reliably. There's also a bit of a catch here: any optimizations we make for ourselves by tweaking queries are likely to be optimizations that would be useful for those writing their own queries via HogQL, or even new queries written internally for products we haven't built yet. If we can generalize them (i.e. rules based optimization), that seems preferable (but it also doesn't mean small improvements should block low hanging fruit for forward progress.)

Factors in prioritizing what should get addressed first

  1. Higher priority customers over lower priority ones (these teams are often highlighted in the dashboards), and problems that impact more teams over those that impact fewer teams
  2. Failures over slow queries (and fatal slow queries over ones that are just slow but run to completion)
  3. Queries that are associated with the online (or default) workload over offline workload
  4. Non-HogQL queries (i.e. queries that we construct, typically product analytics ones) over HogQL ad hoc queries that are written by users

Areas for improvement

Dashboards and analysis

Things about the query log that might not be obvious to a newcomer

A lot is noted here https://posthog.com/handbook/engineering/clickhouse/performance too. (I didn't realize that existed before writing this, there is duplication.)

Awareness and alerting

Ideally this performance work over the longer term would be more push-based based on problem identification versus pull-based by browsing dashboards and query log data to see what might be a problem (or just waiting until we are notified by customers of an issue.) The state of the system is continuously changing due to changes we are making as well as changes in the distribution of ingested data so keeping things moving along smoothly is going to take some constant level of attention.

We see so many different types of queries (queries of the same type with different parameters, or queries of the same type over different customer data sets of different shapes and sizes) that monitoring this based on production behavior has to be extremely targeted/granular to be useful, even before considering the variability of the production system. Categorization here is challenging.

Potential blind spots

Perceived slowness outside of the ClickHouse request/response cycle is not visible via the query log. Using the query log to determining what is "slow" from a user point-of-view might be a misrepresentation if there other intermediary layers adding meaningful latency.

Other resources I found useful

Documentation

Talks and tutorials