timgl commented 4 months ago

A dumping ground for all performance related things I'm looking at.

TODO's are things I want to look at, checkboxes are things that are ready to be worked on by me or others

Layers we go through when we make a query

UI

[x] Enable loading bar for everyone @aspicer
TODO: interface sometimes feels sluggish even on an M1 macbook. Profile the page.

API calls/async queries

TODO: Sometimes we have a long delay in getting celery to pick up a query

Caching

[x] With async queries, even if we hit cache we still wait for the first poll to show them, means we have a 2-3 second delay
[ ] With persons on events, we could think about using cached results to only query the last day of a query to get it up to date
TODO: build dashboard to measure/track cache hit rate

AST

TODO: look at impact of this, see if there are outliers

Queries

[ ] Rewrite person queries to something like this
[ ] Group by is much faster than count(distinct x)
[ ] Trigger alert in slack when a high prio customer has failing queries
[ ] use simpleJSONExtract, rewrite all JSON fields to not submit spaces.
[ ] Rewrite WAU/MAU queries to use arrays instead of joins
[x] https://github.com/PostHog/posthog/issues/22427
[ ] Same as above, but for groups.
[ ] Currently we run series in parallel, but if they're very similar they could be combined into one query
[ ] Rewrite funnel queries using arrays/something better than what it is now
[ ] 0.06% of queries fail for non-timeout reasons (a lot of type issues etc). Compared with 0.3% for timeout/memory issues and 0.03% for simultaneous queries. Should fix this.
[ ] Rewrite feature_flags_users_blast_radius, it's the most inefficient query by far
TODO: Look into enabling query_thread_log
TODO: should we put all new customers on events.person_properties instead of the join?
TODO: Marius apparently measured person_distinct_id_overrides to be 2x as slow as just reading events.person_id directly. (see also: #22775)

Clickhouse

[ ] Make sure we always store the last 3 months of parts on nvme -- huge, I think most of the variability in query speed gets explained by this @fuziontech
[ ] elements_chain is involved in a lot of failing querys. A lot of those queries are text=, which can be materialized
[ ] Re-enable materialization job
- TODO: Get back to BM with a plan for this
[ ] Try out a key/value table of properties as described here
[ ] Change person_id type to UUID
[ ] Don't prioritise Sharding by person_id. In theory this would speed things up, but in practice
[ ] https://github.com/PostHog/posthog/issues/22459
TODO: Look at how often we run into concurrent queries, and should we just up the limit.
TODO: Should we shard by person_id?
TODO: run a query against persons table on EU and US to see if EU's architecture has higher troughput

Tools I've built to look at performance

Debug query performance by query id (metabase)
Debug a team's query performance (metabase)
Debug clickhouse performance using flamegraphs
List of queries that transfer a lot of data between nodes
Bytes read per node
Average bytes/second read (would have caught the delete issue)
Top throughput by normalized queries

Reasons why an individual query times out

Too much data to read
- Because page cache was missed
- Because disks are just too slow
Too much data to transfer over network
Read contention with other queries

Done

Twixes commented 4 months ago

Quick Q: Does this list include all of https://github.com/PostHog/company-internal/issues/1379?

pauldambra commented 4 months ago

TODO: interface sometimes feels sluggish even on an M1 macbook. Profile the page.

maybe this https://posthog.slack.com/archives/C0113360FFV/p1715330140411809

tkaemming commented 2 months ago

This is mostly just a dump of my notes so a little scattered but hopefully there's some useful stuff in here.

Dashboards and tools to figure out what's happening

The overall query performance dashboard gives a good picture of what's going on across the cluster. It's biased towards online workloads because those queries are often more visible to users. This is a good place to start.

Many of these queries are composed of several snippets that are helpful to know about and are generally pretty self-explanatory, query type, high priority team IDs, and query features.

Query features are things that are known to sometimes slow down queries for ease of classification. The list is not comprehensive, and could probably use some attention from somebody who has more context. Just because a feature appears in a slow query does not imply that the presence of that feature is the reason the query is slow, but it can be a starting point for exploration or correlation, like in this example that shows error rate by feature or this example that shows error rate by feature and query type.

There is also an annotated query log question that includes some of this supplemental data and can be a useful starting point for new questions.

Types of problems we run into during query execution

Many of the questions linked to by the dashboard either group by, or can be filtered by the type of problem we encountered running a query.

Errors

MEMORY_LIMIT_EXCEEDED errors (234) are pretty easy to reproduce and relatively easy to identify whether or not they've been fixed after a change is made: queries run over the same dataset after tuning or other fixes will either start to work, or they won't. Other factors typically aren't significant enough to make these queries start (or stop) failing without the query or data being queried changing.

Typical causes

One or more of

Joins with high cardinality: not a lot of optimization potential here, though decreasing the cardinality of the right side table can be helpful, as we did with overrides. Cost-based optimization could help choose the optimal join execution ordering in some cases. (This is going to be more important for user-generated queries than the ones we write, we have enough information and expertise to choose what is typically the more optimal join ordering even if it feels unnatural to write.)
GROUP BY clauses with high cardinality of aggregation keys
GROUP BY clauses that requiring aggregating large values: for example argMax(person.properties, person.version) when only a select few properties are used post-aggregation (such as in the HAVING clause, or referenced by the LHS table in a join on person.) This can be improved with rules-based optimization in some cases (I think HogQL already might in some of these cases? Most of what I've seen is on legacy queries, but I'm not sure for certain.) Also, the further behind we get on merges, the worse this gets for high touch tables like person, etc.

Known issues

Some stuff still needs to be moved to HogQL to make use of existing optimizations: it probably doesn't make much sense to duplicate optimizations such as overrides to old queries versus just updating them. This is also the case with several other types of queries: anything that still uses person_distinct_id2 is an obvious tell (feature pdi2-join), such as get_breakdown_prop_values, user_blast_radius, etc.
Joins will probably need regular, somewhat continuous attention: the sessions join (feature sessions-join) is already starting to get big for some customers and causing queries to hit memory limits. The persons join can get big too (persons-join), but this seems to be largely (but not always) ameliorated by overrides changes.

It's difficult to know where the ceiling is for either of these. Maybe we should set up some alerting on memory_usage to let us know if we're starting to approach the ceiling? In particular because teams with a lot of data (and therefore typically high value ones) run into these issues first.

Slow queries

These come in two flavors, fatal and non-fatal slow queries.

The threshold for slow queries is defined in the SQL snippet defining problem types. It would probably make sense for this to be more discriminating than just applying a single there hold for all types of queries.

Fatal queries are TIMEOUT_EXCEEDED errors (159) and TOO_SLOW errors (160). The difference that we didn't wait for the result set to be computed, so we don't know how slow they actually would be to run to completion. Latency distributions are going to be skewed to towards lower values since these values end up either being excluded from the distribution, or included at an artificially low value. The fact that they don't return a result set to the user obviously makes the impact of a fatal slow query on user experience more significant.

Typical causes

One or more of:

Simply reading a lot of data: lots of rows, many columns — particularly if that data is not already in the page cache and/or if the queried columns are very large in terms of absolute bytes that need to be read
Expensive function calls: e.g. JSON extraction, regex operations, …
Overhead from high cardinality grouping and large joins can also be a factor

When there are multiple causes, it's challenging to tease out which one in particular is the most significant problem in aggregate.

These problems are not always reproducible: execution time can be significantly impacted by other cluster activity (backups, mutations, partial outages, etc) that causes IO contention or cause other capacity limitation/saturation issues. This can lead to lots of false positives when scanning the query log for potential optimization targets when something is only slow due to external circumstances. Some problematic queries can be slow enough to cause an error on an overloaded cluster might just be a query that is frustratingly slow but not so problematic to cause a timeout or get cancelled on a cluster under other conditions. This noise and variance can make it challenging to tell whether or not a change has had the intended effect without benchmarking independent of production paths.

We're also not going to be able to make all queries fast — queries that must read a ton of data are just going to be slow to execute. It is difficult to use the query log data for guidance here about where there are queries that have room for improvement and are candidates for optimization versus those that just inherently slow.

The room for improvement for these queries seems to be mostly in:

Replacing slow constructs with equivalent faster ones: e.g. replacing count(distinct column) with GROUP BY when it's safe to do so, etc. The stuff on Tim's list.
Reducing overall IO and CPU time demands: intelligent caching, materialization, and precomputation strategies (make slow things not slow by simply avoiding them)

Levers we can pull to fix problems

Ordered roughly by degree of impact:

Instance type changes: Better metal means faster queries due to increased IO throughput, faster CPUs, more memory, etc. ClickHouse team responsibility, mostly.
Cluster topology: Better page cache utilization/affinity, reducing network traffic between nodes; also includes workload isolation so that different types of queries with different resource demands or latency expectations don't interfere with each other. Also mostly ClickHouse team.
Generalizable improvements: High leverage changes (eg person overrides) that apply to many different types of queries. Involves schema trickery or query rewriting. Stuff like JSON optimizations, materialization work, etc.
Hot spot optimization: Localized improvements targeted to improve specific queries. Requires a good deal of baseline understanding to be able to identify these changes, and make them efficiently and safely/reliably. There's also a bit of a catch here: any optimizations we make for ourselves by tweaking queries are likely to be optimizations that would be useful for those writing their own queries via HogQL, or even new queries written internally for products we haven't built yet. If we can generalize them (i.e. rules based optimization), that seems preferable (but it also doesn't mean small improvements should block low hanging fruit for forward progress.)

Factors in prioritizing what should get addressed first

Higher priority customers over lower priority ones (these teams are often highlighted in the dashboards), and problems that impact more teams over those that impact fewer teams
Failures over slow queries (and fatal slow queries over ones that are just slow but run to completion)
Queries that are associated with the online (or default) workload over offline workload
Non-HogQL queries (i.e. queries that we construct, typically product analytics ones) over HogQL ad hoc queries that are written by users

Areas for improvement

Dashboards and analysis

Better definition of "high priority teams" vs the ad hoc set defined as a SQL snippet. Doing better would require being able to join on Postgres data from ClickHouse in Metabase, though.
Better filtering by time range, high priority team or not, error type (whether to include slow queries or not), workload/cluster, …, etc. Trying to investigate a particular problem and how big of a priority it actually is typically involves editing a bunch of SQL vs more convenient filtering alternatives in Metabase.
Metabase queries are not very portable, most of these dashboards are only available for US region.
Better query metadata coverage (e.g. fixing team_id being aggregated under 0.)
- Sometimes this captures queries made in the client shell or Metabase directly too
- Batch exports are also included here and questionable whether or not those should be included (some examples seem to be doing JSON operations?)
It's not always obvious when performance problems are caused by other things happening on the cluster without cross-referencing data in other silos (Grafana, incident status, Slack, other people's brains, etc.) or deriving from other data (e.g. periodic large spikes in OSBytesRead on offline cluster are typically backups, etc.)

Things about the query log that might not be obvious to a newcomer

A lot is noted here https://posthog.com/handbook/engineering/clickhouse/performance too. (I didn't realize that existed before writing this, there is duplication.)

FROM clusterAllReplicas(posthog, system, query_log) is needed to query the entire cluster, system.query_log otherwise is only the log for the server the query is issued to.
It's important to use is_initial_query in the WHERE clause, otherwise errors for distributed queries will be counted multiple times. (You can also use initial_query_id to view all queries that were executed as part of a multi-stage distributed query.)
Some columns like tables (among others) are not filled in if the query does not start (type = ExceptionBeforeStart) which can happen even in scenarios where you might not expect it to occur, like timeouts.
There is a bunch of useful data in the log_comment column. The cleanest way to get access to this data consistently is through a snippet that parses the data in that column into a tuple, which can then be reused across any query that references that snippet.
The ProfileEvents mapping also has a bunch of useful stuff. The difference between OSReadChars (includes page cache) and OSReadBytes (doesn't include page cache) can be used to get a sense of page cache hit rate, though sometimes the math doesn't always add up.

Awareness and alerting

Ideally this performance work over the longer term would be more push-based based on problem identification versus pull-based by browsing dashboards and query log data to see what might be a problem (or just waiting until we are notified by customers of an issue.) The state of the system is continuously changing due to changes we are making as well as changes in the distribution of ingested data so keeping things moving along smoothly is going to take some constant level of attention.

We see so many different types of queries (queries of the same type with different parameters, or queries of the same type over different customer data sets of different shapes and sizes) that monitoring this based on production behavior has to be extremely targeted/granular to be useful, even before considering the variability of the production system. Categorization here is challenging.

Potential blind spots

Perceived slowness outside of the ClickHouse request/response cycle is not visible via the query log. Using the query log to determining what is "slow" from a user point-of-view might be a misrepresentation if there other intermediary layers adding meaningful latency.

Other resources I found useful

Mapping of Error Codes to Readable Name (errorCodeToName(exception_code) can also be used)
Profile Events

Documentation

JOIN Series
Altinity Caching in ClickHouse Guide - Part 1 (Note: I don't think there is a "Part 2"??)
Altinity Knowledge Base - GROUP BY Implementation Information and Optimizations

PostHog / posthog

Tim's Performance Mega Issue Q2/Q3 2024 #22413

Layers we go through when we make a query

Tools I've built to look at performance

Reasons why an individual query times out

Done

22380

Dashboards and tools to figure out what's happening

Types of problems we run into during query execution

Errors

Typical causes

Known issues

Slow queries

Typical causes

Levers we can pull to fix problems

Factors in prioritizing what should get addressed first

Areas for improvement

Dashboards and analysis

Things about the query log that might not be obvious to a newcomer

Awareness and alerting

Potential blind spots

Other resources I found useful

Documentation

Talks and tutorials