Clickhouse performance improvement

EDsCODE commented 3 years ago

Is your feature request related to a problem?

Please describe. We want to aspire to have subsecond query times. Currently, queries might take several seconds or simply memory error.

Funnel specific improvement:

We rely on clickhouse's experimental window functions which means we're subject to certain constraints that have not been solved yet. In particular, as noted by altinity in a blog post:

For distributed tables, the current implementation forces window function calculation to be performed on the initiator node. This means that window function calculation is not spread out between shards but instead each shard has to send its data to the initiator to perform calculations.

This is immediately apparent when you try to run the funnel query on a "large" amount of data. For example, this query collects 300m events and memory errors.

Possible next steps:

Don't use clickhouse windows yet (might not be possible if we want the granularity that we can achieve right now)
Fork and implement distributed window functions (could be too technically complex to be done quickly)
Extend windowFunnel functionality in clickhouse source
Sampling

General improvement:

We join person_ids into the events table whenever we need to do any queries that rely on uniqueness of persons. This is effectively every query besides a simple trends query that just counts totals of events. These joins are costly
We deserialize json objects in order to filter on properties (both event and person properties).
We currently calculate action relationships at query time

Possible next steps:

Figure out how much of an improvement these solutions would provide. Would it allow us to efficiently run queries at the scale of hundreds of millions of events in a query?
Figure out which improvements are most important to tackle first. Is property deserialization more costly than the person id joins?
Establish benchmarks of specific queries that are currently not performant

macobo commented 3 years ago

Goals

We want to aspire to have subsecond query times.

I assume this is for large customers e.g. 572's size. I don't think this is a reasonable goal (too high) because:

No analytics service really clears it without cheating (see below). E.g. heap, mixpanel, auryc etc all have queries which can take 5+ seconds even with much more modest flows.
I think it's not needed during exploration if the UX around query construction is nice. Again, <5 seconds feels fast enough if you have to press a "run" button anyways.
A lot of what users see is calculated for them in the background. E.g. if a dashboard item takes 20s to calculate in the background but shows instantly, that's great.

I think our goal should be two-fold:

Aim for <5s for loading insights during data exploration even for large customers
Build good UX around "perceived performance"

How to get there: Ideas

The above post points out a few good options, let me add a few more. In general the answer is one of "fine-tuning", "being smarter" or "cheating"

1. Being smarter: person_ids on events

We join person_ids into the events table whenever we need to do any queries that rely on uniqueness of persons. This is effectively every query besides a simple trends query that just counts totals of events. These joins are costly

This trades off ingestion speed for query gains.

This would for sure speed up queries if it was free. However, I think this is one of the hardest things to implement because of person merges:

It might require UPDATE queries to update events, which is really really costly in clickhouse terms
Or you end up with collapsingMergeTree, SELECTing from clickhouse + INSERTing a bunch of -1 sign rows. The GROUP BY queries might also end up being slower than what's being replaced.

It's also really really susceptible to race conditions. E.g. what if you UPDATE but new events for person_id are sitting in the kafka queue? Or you SELECT + Insert and the same.

Other companies have gone down this route, here be dragons.

2. Fine-tuning: deserializing json objects

This trades off ingestion speed for io gains (query speed)

I think (the MATERIALIZED column version of this) is a good potential optimization and not hard to implement. It does trade off some cost in kafka ingestion, but we haven't been struggling there.

While fine-tuning is often down the line of optimizations you'd like to make this one has potential to affect some queries X0%.

3. Being smarter: calculate action relationships at query time

This trades off disk space/extra ingestion for a query speedup.

I'm not familiar enough w/ current plugin server action matching implementation or how the queries would change to judge it though.

4. Cheating: Sampling

This trades off accuracy for potential speed

We could set up sampling on the events based on SAMPLE BY hash(distinctId) or smth

For larger customers, we can indicate that a query is sampled in insights with a toggle, speeding up data exploration by a lot. In dashboards (since these queries are not time-sensitive) we could not sample by default.

5. Cheating: don't recalculate everything (trends)

Currently every time we run a trends query, we recalculate everything, even if no data has changed. Instead, we could:

Save the results (bucketed by day) + max _timestamp
Next time run query only on data with _timestamp > one from last one
Combine the two results

This would speed trends up significantly. Not sure if/how this could be applied to other analyses though.

Product which implements this: https://github.com/trickstercache/trickster

6. Cheating: don't recalculate everything (funnels)

Currently we run a single query per funnel. What if we instead:

Ran 1 query per funnel step, saving dates + persons per step
Combining them later

This would allow us to skip reprocessing old data similar to proposal (5)

7. Being smarter: Pre-aggregating data

Trade-off: Extra space for query speed

@marcushyett-ph has mentioned that FB analytics systems rely heavily on pre-aggregating data for analysis. I don't have a clear idea on what the trade-offs they are making are (e.g. are they limiting pivoting or bucketing data in some ways we don't) but this could speed up some queries a lot.

8. UX optimization: dashboard refreshes

Instead of taking dashboard graphs away when refreshing, we could show the previous dataset until data has finished loading, one graph at a time, with a loading indicator

9. UX optimization: Fail faster

Currently our cluster is set to reject queries that would take longer than 180 seconds. In the frontend however we:

Show timeout warning after 15 seconds
After 60 seconds often the request gets cancelled by our load balancer/nginx/whatever

There is room for improvement here:

Polling rather than long-polling query results
Shipping results in parts as data comes in
Rejecting queries from clickhouse from frontend that would take longer than X seconds (60?). Note we want to allow these from celery/dashboard refreshes though.

10. Leverage projections (new feature)

Clickhouse has a new feature named projections. This can be used to materialize sorted/grouped columns which can then be used to speed up queries. Potentially useful for speeding up persons/person_distinct_id joins for example or other places where we do tricky but static GROUP BY/SORT BY queries.

11. Being smarter - distinct_id on persons table

Random other idea: distinct ids on person table? Would reduce one join when filtering on person properties, though might be slower for just unique users

X: Fine-tuning the database cluster

There are other levers we can pull:

Distributing data across more nodes if subqueries properly distribute to them as well (see erics post on how this might fail though)
Faster network speeds
More and faster CPUs
Faster disks
More RAM

I doubt we're in the best possible configuration for the next 2 years. As data from measurements emerges and we identify hotspots emerges.

How to get there: execution

Steps eric outlined above are correct. To prioritize, I'd leverage thinking from @marcushyett-ph to approach one product area at a time and:

Make sure the measurements we have are adequate
- i.e. is there a graph that clearly indicates a problem?
- Can we identify queries which take long and why
Dig into what's going on, what resources are saturated vs not during a query. Some resources we haven't leveraged a lot are:
- running queries on backup hosts
- clickhouse query plans
- clickhouse query flame graphs https://github.com/Slach/clickhouse-flamegraph
- measuring i/o, memory, cpu at query time
Make a predition & fixing. Ideal order goes like "cheats" > "fix obvious issue (e.g. missing predicates)" > "being smarter" > "fine-tuning"
Measure whether fix matches prediction

fuziontech commented 3 years ago

Old work on putting person on event using an event store. https://github.com/PostHog/posthog/pull/2018

marcushyett-ph commented 3 years ago

We want to aspire to have subsecond query times.

I'd love to understand what the p90 of query times look like today, it might be that the specific problems with one or two clients with uniquely complex setups actually fall into the top 10%, and we could reasonably achieve a sub second query time goal for 90% of queries- this way we're focussed on optimizing for: "almost everyone getting an amazing experience" vs "everyone getting an okay experience". Personally I think the former will drive us to make hard-trade-offs that will increase overall satisfaction of our customers the most.

Make sure the measurements we have are adequate

Would love to see a comprehensive dashboard of our performance

Somewhat related to pre-aggregation

I wonder how necessary it is to have all event data up-to-date, probably most people log in once per day (it's around 1.5 logins per person per day on average), they don't need the events from the last 5mins to make a decision. Perhaps we could exploit this trade-off in some way (not sure how exactly)

macobo commented 3 years ago

Gave marcus the rundown of our grafana query dashboard + instance status metrics + app performance dashboard. One cool thing he noticed was that /api/event endpoint is likely skewing data a lot - unlike everything else, it's polling data frequently for a really cheap query.

marcushyett-ph commented 3 years ago

Indeed - anyway we could have a quick graph of the query times (p50, p90, p99), excluding any "trivial" endpoints?

EDsCODE commented 3 years ago

+1 to event query skewing run times. Going to compile data specific to certain types of insight queries to get a better look on runtimes. I shared preliminary results in a message on slack

neilkakkar commented 3 years ago

Just for preserving context somewhere:

I experimented with Live Views on our testk8s cluster to see if we can evade the person_distinct_id joining vs person_id on events trade-off.

It basically creates an up-to-date view of the events table, with the person_id and team_id appended to it.

Comparing two funnels with this (~482k rows), I got the results:

with Live View: 2010ms, total ~482k rows in funnel
without Live View: 1436ms, total ~482k rows in funnel

I think it's slower because of the extra bells and whistles on live views, like _watch_ing queries, but I wasn't 100% sure.

A weaker hypothesis: maybe this gets slower because the effective table size increases slightly vs joining (joining cuts over team_id much earlier). Just to confirm it's not this, I created a new view that cuts on a team id as well: in effect, it's a table just for a single team. No dice here, either. (This table had total ~2.5M events, vs 3.1M before. This isn't the best representative sample, since one team had almost all the events)

There's some improvement, but nothing significant.

with Live View:  1900ms
without Live View:  1385.5ms

Edit: There must come a size when pre-calculating the joins + the WATCH overhead is better/faster than on the fly joins. With the two sample cases we have, a linear extrapolation (very rough, too many assumptions about how WATCH scales) would suggest that at ~10M events, the performance should be comparable, and higher than that, we get good gains with live views.

I'll leave this as an additional hypothesis to test once we get to that stage. Closing experiment for now.

EDsCODE commented 3 years ago

Here's some data analysis of trend query runtimes along with props used in trend queries.

Results:

persons related queries and cohort (non precalculated) queries average runtimes take ~twice as long as the average runtime across all queries
averages hover around <1 second-3 seconds but max runtimes can go up to 300+ seconds
there's about 2 dozen properties deserializations that occurred more than 200 times and only 10 properties occurred more than 1000 times in the last two weeks
in total, 206 properties were filtered on in trends query in the last two weeks

fuziontech commented 3 years ago

This is fantastic. This just goes to show how dense queries on props really are and how long the long tail is. Dark green is 80% of query volume use these keys from props Middle green is 90% of query volume use these keys from props Light green is 95% of query volume use these keys from props

Twixes commented 1 year ago

As far as I can tell, these concerns are solved with materialized properties and persons on events.

PostHog / posthog