Open timgl opened 4 months ago
Quick Q: Does this list include all of https://github.com/PostHog/company-internal/issues/1379?
TODO: interface sometimes feels sluggish even on an M1 macbook. Profile the page.
maybe this https://posthog.slack.com/archives/C0113360FFV/p1715330140411809
This is mostly just a dump of my notes so a little scattered but hopefully there's some useful stuff in here.
The overall query performance dashboard gives a good picture of what's going on across the cluster. It's biased towards online workloads because those queries are often more visible to users. This is a good place to start.
Many of these queries are composed of several snippets that are helpful to know about and are generally pretty self-explanatory, query type, high priority team IDs, and query features.
Query features are things that are known to sometimes slow down queries for ease of classification. The list is not comprehensive, and could probably use some attention from somebody who has more context. Just because a feature appears in a slow query does not imply that the presence of that feature is the reason the query is slow, but it can be a starting point for exploration or correlation, like in this example that shows error rate by feature or this example that shows error rate by feature and query type.
There is also an annotated query log question that includes some of this supplemental data and can be a useful starting point for new questions.
Many of the questions linked to by the dashboard either group by, or can be filtered by the type of problem we encountered running a query.
MEMORY_LIMIT_EXCEEDED
errors (234) are pretty easy to reproduce and relatively easy to identify whether or not they've been fixed after a change is made: queries run over the same dataset after tuning or other fixes will either start to work, or they won't. Other factors typically aren't significant enough to make these queries start (or stop) failing without the query or data being queried changing.
One or more of
GROUP BY
clauses with high cardinality of aggregation keysGROUP BY
clauses that requiring aggregating large values: for example argMax(person.properties, person.version)
when only a select few properties are used post-aggregation (such as in the HAVING
clause, or referenced by the LHS table in a join on person
.) This can be improved with rules-based optimization in some cases (I think HogQL already might in some of these cases? Most of what I've seen is on legacy queries, but I'm not sure for certain.) Also, the further behind we get on merges, the worse this gets for high touch tables like person, etc.person_distinct_id2
is an obvious tell (feature pdi2-join
), such as get_breakdown_prop_values
, user_blast_radius
, etc.sessions-join
) is already starting to get big for some customers and causing queries to hit memory limits. The persons join can get big too (persons-join
), but this seems to be largely (but not always) ameliorated by overrides changes.It's difficult to know where the ceiling is for either of these. Maybe we should set up some alerting on memory_usage
to let us know if we're starting to approach the ceiling? In particular because teams with a lot of data (and therefore typically high value ones) run into these issues first.
These come in two flavors, fatal and non-fatal slow queries.
The threshold for slow queries is defined in the SQL snippet defining problem types. It would probably make sense for this to be more discriminating than just applying a single there hold for all types of queries.
Fatal queries are TIMEOUT_EXCEEDED
errors (159) and TOO_SLOW
errors (160). The difference that we didn't wait for the result set to be computed, so we don't know how slow they actually would be to run to completion. Latency distributions are going to be skewed to towards lower values since these values end up either being excluded from the distribution, or included at an artificially low value. The fact that they don't return a result set to the user obviously makes the impact of a fatal slow query on user experience more significant.
One or more of:
Overhead from high cardinality grouping and large joins can also be a factor
When there are multiple causes, it's challenging to tease out which one in particular is the most significant problem in aggregate.
These problems are not always reproducible: execution time can be significantly impacted by other cluster activity (backups, mutations, partial outages, etc) that causes IO contention or cause other capacity limitation/saturation issues. This can lead to lots of false positives when scanning the query log for potential optimization targets when something is only slow due to external circumstances. Some problematic queries can be slow enough to cause an error on an overloaded cluster might just be a query that is frustratingly slow but not so problematic to cause a timeout or get cancelled on a cluster under other conditions. This noise and variance can make it challenging to tell whether or not a change has had the intended effect without benchmarking independent of production paths.
We're also not going to be able to make all queries fast — queries that must read a ton of data are just going to be slow to execute. It is difficult to use the query log data for guidance here about where there are queries that have room for improvement and are candidates for optimization versus those that just inherently slow.
The room for improvement for these queries seems to be mostly in:
count(distinct column)
with GROUP BY
when it's safe to do so, etc. The stuff on Tim's list.Ordered roughly by degree of impact:
team_id
being aggregated under 0
.)
OSBytesRead
on offline cluster are typically backups, etc.)A lot is noted here https://posthog.com/handbook/engineering/clickhouse/performance too. (I didn't realize that existed before writing this, there is duplication.)
FROM clusterAllReplicas(posthog, system, query_log)
is needed to query the entire cluster, system.query_log
otherwise is only the log for the server the query is issued to.is_initial_query
in the WHERE
clause, otherwise errors for distributed queries will be counted multiple times. (You can also use initial_query_id
to view all queries that were executed as part of a multi-stage distributed query.)tables
(among others) are not filled in if the query does not start (type = ExceptionBeforeStart
) which can happen even in scenarios where you might not expect it to occur, like timeouts.log_comment
column. The cleanest way to get access to this data consistently is through a snippet that parses the data in that column into a tuple, which can then be reused across any query that references that snippet.ProfileEvents
mapping also has a bunch of useful stuff. The difference between OSReadChars
(includes page cache) and OSReadBytes
(doesn't include page cache) can be used to get a sense of page cache hit rate, though sometimes the math doesn't always add up.Ideally this performance work over the longer term would be more push-based based on problem identification versus pull-based by browsing dashboards and query log data to see what might be a problem (or just waiting until we are notified by customers of an issue.) The state of the system is continuously changing due to changes we are making as well as changes in the distribution of ingested data so keeping things moving along smoothly is going to take some constant level of attention.
We see so many different types of queries (queries of the same type with different parameters, or queries of the same type over different customer data sets of different shapes and sizes) that monitoring this based on production behavior has to be extremely targeted/granular to be useful, even before considering the variability of the production system. Categorization here is challenging.
Perceived slowness outside of the ClickHouse request/response cycle is not visible via the query log. Using the query log to determining what is "slow" from a user point-of-view might be a misrepresentation if there other intermediary layers adding meaningful latency.
errorCodeToName(exception_code)
can also be used)GROUP BY
Implementation Information and Optimizations
A dumping ground for all performance related things I'm looking at.
TODO's are things I want to look at, checkboxes are things that are ready to be worked on by me or others
Layers we go through when we make a query
UI
API calls/async queries
Caching
AST
Queries
Clickhouse
Tools I've built to look at performance
Reasons why an individual query times out
Done
22380