apache / druid

Apache Druid: a high performance real-time analytics database.
https://druid.apache.org/
Apache License 2.0
13.41k stars 3.68k forks source link

Performance degradation in topN queries when SQL-compatible null handling is enabled #9321

Open sascha-coenen opened 4 years ago

sascha-coenen commented 4 years ago

Affected Version

0.16.0, 0.17.0

Description

Given a Druid v0.16.0 or v0.17.0 cluster configured with SQL-compatible null handling enabled, the intial performance we measured was inconspicuous, but after a while, there would be a drastic performance degradation for topN queries.

After much testing we found out that initial performance of a freshly started Druid cluster would be consistently fast UNTIL a groupbyV2 query gets exececuted for the first time. After that the performance of topN queries with numerical sort would degrade by 70% or more. This degradation is specific to topN queries and also seems to apply only to heavy topN queries (8 aggregations, several sequential passes). Our tests are performed on a Druid cluster that exclusively holds segments produced with prior Druid versions or with SQL-compatible null handling disabled.

We looked at any operational metric we have but could not find a root cause for the degradation. The degradation would not fade out with time. Also, a forced full garbage collection would not recover any performance. Furthermore the execution of a single groupbV2 query, any groupbyV2 query seems to trigger the degradation.

We have a performance testsuite and a metrics dashboard. In the screenshots from the perf testsuite below you can see the degradation in topn queries after the execution of the first groupbyV2 query in the before/after view.

Furthermore, the dashboard shows a different test we performed to illustrate the performance degradation: we initially sent a sequential stream of topN queries to a freshly started Druid cluster for a long time. Then we issued a single groupbyV2 query while the stream of topN queries would continue. One can clearly see how performance degrades immediately and is constant before and after. The dashboard shows the segment-scan-time metrics to illustrate that the degradation happens on the historicals by way of decreased scan times.

In the attempt to hone in on root-causes, we ran further tests that had subsystems of Druid disabled:

As we keep sending the same query many times, we can also rule out effects caused by disk access because the segments needed for serving the query would be paged into memory.

Then we turned off the SQL-compatible null handling and the performance issue was gone.

perf-suite-before-after performance-issue-topn-groupby-sqlcompatiblenullhandling

UPDATE: meanwhile we have updated our Druid cluster to version 0.17.0 and reran our tests and can confirm that the performance issue still remains.

Following is an example query which would have a degraded performance after the first execution of a groupbyV2 query: { "queryType": "topN", "dataSource": "supply-activities", "intervals": "2016-12-03T00/2016-12-04T00", "context": { "timeout": 180000, "priority": 0, "useCache": false, "populateCache": false, "useResultLevelCache": false, "populateResultLevelCache": false, "minTopNThreshold": 1000, "vectorize": "false" }, "dimension": "adSpaceId", "metric": { "type": "numeric", "metric": "interactions" }, "threshold": 50, "granularity": "all", "aggregations": [ { "name": "auctionsWon", "type": "doubleSum", "fieldName": "auctionCount" }, { "name": "views", "type": "longSum", "fieldName": "view" }, { "name": "clicks", "type": "longSum", "fieldName": "click" }, { "name": "interactions", "type": "longSum", "fieldName": "interactionCount" }, { "name": "validInteractions", "type": "longSum", "fieldName": "validInteractionCount" }, { "name": "erroneousInteractions", "type": "longSum", "fieldName": "erroneousInteractionCount" }, { "name": "noAdInteractions", "type": "longSum", "fieldName": "noAdInteractionCount" }, { "name": "accountableRevenue", "type": "doubleSum", "fieldName": "accountableRevenue" } ] }

a2l007 commented 4 years ago

This might be related: https://github.com/apache/druid/pull/8822

sascha-coenen commented 4 years ago

I updated the ticket description. We have now tested this with druid 0.17.0 and can report that the issue is still present in this latest release.

suneet-s commented 4 years ago

interesting finding... Can you profile the setup to see if there are any clues about what is different between before / after running the first group by v2 query.

Some instructions on how to do that - https://support.imply.io/hc/en-us/articles/360033747953-Profiling-Druid-queries-using-flame-graphs

sascha-coenen commented 4 years ago

Hi, thanks for the pointer. sjk looks like a supernice tool. really easy to use.

I attached a zip archive with the "before and after" flamegraph reports. I'm not an expert at reading flamegraphs, but to me the biggest difference seems to be the org.apache.druid.query.aggregation.NullableNumericBufferAggregator.aggegate method.

flamegraphs.zip

suneet-s commented 4 years ago

Thanks for the flamegraphs! I read the code last night and based on the flamegraphs, the best thing I can think of that might be happening is maybe something in the hotspot is less optimal after the group by query is run. The buffers are shared across the TopNQueryEngine and the GroupByQueryEngine, so somehow running a group by query is impacting the topN performance.

Since it's this low level, I'd also question the reality of the scenario. It might be less visible if the performance tests were running a mixture of queries in both runs.

My guess at this point is this an edge case, and I don't think we'd see a huge difference if this were a mixed load scenario

I'm also attaching screenshots from your flamegraphs that highlight what you pointed out, so others with more knowledge than me can either confirm these suspicions or suggest what might actually be happening.

When only topN queries were run - the time spent in NullableNumericBuffer#aggregate was mainly spent in reading the bytes (LZ4...)

topN-before-groupBy

After a groupBy query was run, and topN queries were profiled - NullableNumericBuffer#aggregate has a bunch of time spent where it isn't reading the bytes, but doing something else

topN-after-groupBy
sascha-coenen commented 4 years ago

Since it's this low level, I'd also question the reality of the scenario. It might be less visible if the performance tests were running a mixture of queries in both runs My guess at this point is this an edge case, and I don't think we'd see a huge difference if this were a mixed load scenario

Let me explain again, perhaps I can be more clear: From the point that a Druid historical executes a single groupbyV2 query for the first time, the execution of any later topN queries is much slower than before and remains at that deteriorated performance forever. Prior to the execution of a groupbyV2 query, the topN queries are consistently fast and remain so forever. These are steady-states. Within a mixed workload before or after, the topN queries will show the same degradation. It's not a dynamic temporary effect.

If you look at the segment scan time graph, you see that scan times for topN queries are very consistent before and after the single execution of a groupBy query, just much higher afterwards. This graph would continue in this static state forever. Also, the topN peformance prior to executing a first groupBy query is thee same as in a Druid cluster configured without SQL-compatible null handling.

So this is not an edge case by any means: If I were to switch the SQL-compatible null handling on in our production cluster which is running mixed workloads, the moment a groupby query would get executed for the first time, query performance for topNs would plummet considerably and not recover.

gianm commented 4 years ago

It could be that something about your sequence of queries defeats de-virtualization or inlining as described on https://shipilev.net/blog/2015/black-magic-method-dispatch/. Maybe something that was monomorphic or bimorphic with only topNs in the mix becomes megamorphic once groupBys are in the mix as well (due to different usage patterns of some shared component). I don't think I can speculate about what component that might be, since there are a lot of shared components and some of them are pretty complex. It might not even have anything to do with groupBy — it's possible that your groupBy queries are using some features (specific aggregators, filters, etc) that your topNs aren't using.

But if this is what's going on, then vectorization should fix it, since it makes function call overhead less of an issue. So we should continue with our program of vectorizing everything (topN isn't yet).

gianm commented 4 years ago

Btw, have you tried the vectorized groupBy engine? It doesn't apply in every case yet (see the docs at https://druid.apache.org/docs/latest/querying/query-context.html#vectorizable-queries) but I'm curious how well it works for you. We'd like to make this the default in a future release.

sascha-coenen commented 4 years ago

Hi. we tried the vectorization engine extensively and it works VERY well for us. Most of the times throughout the years when you guys reported significant performance gains, we could usually not reproduce them in our clusters very well. But with this thing we are right in the center: we are using local filter conditions a LOT and prior to vectorization, to my understanding, Druid had to scan over those, not applying bitmap filters whereas vectorization would already work with bitmaps. I don't know whether this assumption is true, but if feels to me that this is where the speedups might be the biggest. We do see speedups to 3x for timeseries across the board and 2x for groupby queries. The only exception is that we have this very often used class of queries, namely groupby's with an order by clause on a measure which don't profit from vectorization at all because they seem to be bound by the data volume streamed back from the historicals. To speed up this important class of queries we very much depend on the "forceLimitPushDown' feature which unfortunately hasn't been working for SQL queries for a while (since 0.11.0). This in combination with the vectorization engine would be blow away amazing.

wrt the above perf issue: what you describe sounds reasonable. Let me add that it is very easy for us to reproduce this performance degradation. It does not matter which kind of groupbyV2 query we use. We basically saw it happen with pretty much any query. With topn it seems that the query needs to have a certain weight, like a couple of measures and I more than one sequential pass. You should be able to reproduce this on your end easily. Do you not experience any performance penalty in topNs at all when using SQL compatible null handling?