fivetran / benchmark

Benchmark data warehouses under Fivetran-like conditions
161 stars 42 forks source link

Tune Redshift Properly #1

Closed davidgasquez closed 3 years ago

davidgasquez commented 6 years ago

Hey there!

Although I totally get why you didn't tune Redshift, it might be interesting to have a comparison of the speedup that Redshift could get knowing the queries you're going to run.

It might not be fair but I can see a lot of use cases where you want some specific queries to be fast and I'd love to see how fast they can be and how it compares to Big Query.

Would love to see this in Part 2 it is not a lot of work!

georgewfraser commented 6 years ago

I agree, it would be nice to see how the results change with more tuning across all warehouses. We would also want to implement partitioning in Snowflake, which is sort of a combination of dist and sort keys.

joeharris76 commented 6 years ago

+1. Almost every query is filtered on year.

georgewfraser commented 6 years ago

After thinking about this more, I think what we should do is present Redshift twice:

Snowflake and BigQuery don't have the equivalent of distribution-style, so we'll basically be adding "Redshift with tuning" as a 4th warehouse. I think most people know if they're the kind of user who is going to tune their warehouse, or just use it out-of-the-box, so everyone can look at the charts and decide which type of user they are, and therefore which comparison is more relevant to them.

I don't think it's terribly important to evaluate sort keys. Sort keys / clustering / date-partitioning are present in all the warehouses, and they're extremely effective when there's a date predicate on the query. Since they don't really differentiate warehouses, I think it makes sense to just leave them out.

georgewfraser commented 6 years ago

@davidgasquez @joeharris76 We are finally publishing an update of this benchmark, and we added DISTKEY to the item and customer surrogate keys. These are the keys which participate in the fact-to-fact joins, which is where DISTKEY can really help. You can see the code here:

https://github.com/fivetran/benchmark/blob/master/201-PopulateRedshiftOptimized.sh#L45

We didn't do sort keys because sort keys are available in all data warehouses, and work extremely well in all of them, so it's not an important point of comparison. Also, sort keys don't really help with TPC-DS, because the date predicates are all on the dimension tables.

DISTKEY makes little difference, and actually makes things slightly slower on 1 TB scale, possibly because of the large compilation times of many of these queries:

Scale: 100 GB 100 GB 1 TB 1 TB
Node type: dc2.large dc2.large dc2.8xlarge dc2.8xlarge
Node count: 8 8 4 4
Distkeys? No Yes No Yes
Median 18.6 19.1 20.8 24.9
Average 24.2 23.0 29.9 37.5

If anyone following this issue has any feedback on what we've done, we'd love to hear it! We're planning to publish in a couple weeks but it's not too late to try other configurations.

joeharris76 commented 6 years ago

Please consider using my revised DDL for the "optimized" Redshift.

It specifies the data types more tightly, adds sort keys to the appropriate columns, and sets the smaller tables to be distributed on all nodes. Saves ~20% disk space and ~10-30% query time.

Regarding "sort keys are available in all data warehouses", Redshift is designed for sort keys. Not using them is a lot like not using an index or SQL Server or Oracle. It still works but you can't call it optimal. :)

https://github.com/fivetran/benchmark/pull/5

georgewfraser commented 6 years ago

Thanks @joeharris76 we're looking closely at your PR. I think we're going to try adding the changes one-at-a-time, in this order:

The thing about sortkeys is that you have put them on the exact columns that happen to have predicates on them in the queries we're benchmarking. This is, in my opinion, "bad benchmarking". Real users write all sorts of queries and usually don't know in advance which columns will benefit from sort keys. Having said that, people yell at me about this all the time when I give talks about this benchmark, so I want to run the alternate version and include it in a "deep dive" version.

Shrinking the varchars is a really tough call. I have seen LOTS of real users who just set their varchars to 256 so they don't have to worry about column types changing in the source. Having said that, if you use Fivetran to sync a database or Salesforce to Redshift, you will get "tight" varchars as you have specified here. I lean towards switching to the tight varchars you've suggested, possibly in both optimized and unoptimized redshift.

georgewfraser commented 6 years ago

Hi @joeharris76 , I've re-run with:

GEOMEAN(time):

First run 2nd run (no compilation)
100 GB Naive 18.1 3.8
Optimized 16.1 2.3
Ratio 0.89 0.61
1 TB Naive 20.3 6.3
Optimized 17.0 3.6
Ratio 0.84 0.57

Firstly, the above table shows how much the wall-clock times are being driven by compilation time, as opposed to runtime. Second, if you look at just the runtime, it's about 40% faster with these very aggressive optimizations (arguable too aggressive on the sort keys).

joeharris76 commented 6 years ago

Awesome! Query compilation is a pain for benchmarks like this. If you check an active cluster though you’ll find a very small percentage of real world queries have to be compiled. I can provide a query to calculate if needed.

georgewfraser commented 6 years ago

It is really tough to figure out how to treat query compilation fairly. Part of my thinking is, it's the interactive, ad-hoc queries where users judge the performance of the warehouse, and that's where it's hardest for the compiler to cache things. Having said that, I've played around with "interactively editing" TPC-DS queries by editing subqueries, changing the group by clause, and monitoring what get's re-compiled. Redshift seems to do a pretty good job of re-using previously-compiled parts of the query, so the "ad-hoc query" scenario isn't quite as bad as this benchmark makes it seem.

I think what we're going to do is publish this with a giant asterisk that says "we are probably overstating the effect of compilation times for Redshift", and try to improve it in the next iteration. In my mind, the ideal thing would be to take each TPC-DS query and turn it into 5 variants that make meaningful changes to the query (not just changing constant values). The idea is to simulate an analyst working on a query interactively. That would allow the compiler cache to kick in, without making it too easy, and I think it would benefit Redshift disproportionately because it has the slowest compiler.

georgewfraser commented 6 years ago

I ran a less-aggressively-optimized version:

https://github.com/fivetran/benchmark/commit/8abe0309d5484649452aaff53da945eaa34bf9ab#diff-13970533686291d621dfa0387451d11a https://github.com/fivetran/benchmark/commit/030f3da4071265419f86370959ec471d17c22a20#diff-13970533686291d621dfa0387451d11a

and found that about half the speedup was lost:

Time Time / Naive 2nd run (no compilation) 2nd run / Naive
100 GB Naive 18.1 1.0 3.8 1
Dist-keys only 16.2 0.90 3.3 0.88
Aggressive optimizations 16.1 0.89 2.70 0.71
1 TB Naive 20.3 1.00 6.3 1
Dist-keys only 18.5 0.91 5.5 0.88
Aggressive optimizations 17.0 0.84 3.6 0.58

The big picture of all this is that if the goal is interactive query speeds, sort and dist keys are far less important than how well the compiler cache works. If I had to ballpark it, I'd guess that the midpoint of first run--2nd run is probably what a user who's interactively editing the same query should expect.

joeharris76 commented 6 years ago

I don't consider these sort keys to be aggressive TBH. The "beauty" of sort keys is that data is often correlated across columns in a DW scenario. Even a suboptimal sort key choice can be fairly effective for reducing scans generally.

Removing the NOT NULL declarations is quite negative for joins. The optimizer can choose more efficient join methods and skip some checks when it knows the columns cannot contain nulls.

De-tuning the data types forces the optimizer to allocate more memory for every row used from these columns during execution. That cost is compounding because the extra memory will have to be accessed or copied by each query step.

In my experience compilation time is only an issue for synthetic benchmarks. Query compilation in Redshift is quite low level. In real life a complex "ad hoc" exploration query might have dozens of steps. Typically most steps have already been compiled and only 1 or 2 require compilation.

You can use SVL_COMPILE to review compilation stats. I'd suggest reviewing the stats from a few active Fivetran customer clusters to see what the compile % is and using that as a discount function to calculate the metric.

Have you considered using total runtime in the benchmark instead of (or in addition to…) avg query time? Seemingly small advantages in query time accumulate over the 99 queries and a single outlier tends to skew the average.

georgewfraser commented 6 years ago

I don't consider these sort keys to be aggressive TBH

Your sort keys literally pick out the exact columns that have predicates on them in the TPC-DS benchmark https://github.com/fivetran/benchmark/commit/030f3da4071265419f86370959ec471d17c22a20#diff-13970533686291d621dfa0387451d11a

Removing the NOT NULL declarations is quite negative for joins

Sure, but we used nullable types on all the other warehouses.

De-tuning the data types forces the optimizer to allocate more memory for every row used

Again, same configuration for all the other warehouses.

In my experience compilation time is only an issue for synthetic benchmarks

I've played around with interactively editing queries and studied the SVL_COMPILE output, and it seems that Redshift is able to reuse about half the previous compilation result if you make significant changes to a subquery. I'm not super confident in this number though, we're planning to publish this iteration with a big asterisk saying "we are probably overstating the impact of compilation time" and revisit this in a few months. This is definitely the weakest part of the whole story, we just don't have a good assessment of how important the (huge) difference in compilation times is.

Have you considered using total runtime in the benchmark

The numbers I've reported are geomeans, which I think accomplishes what you're getting at---short queries and long queries are weighed equally in the outcome.