delta-io / delta

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
https://delta.io
Apache License 2.0
7.41k stars 1.66k forks source link

[BUG] Performance downgrade using OPTIMIZE (compaction) with Trino #1151

Closed Dearkano closed 1 year ago

Dearkano commented 2 years ago

Bug

Describe the problem

We were happy to see that OSS delta lake supported compaction, so we upgraded and tried to optimize our current tables. However, we observed the performance downgrade after compaction.

We use Starburst Enterprise (Trino) to query the delta lake on the top of S3, and here's our performance test using TPC-DS, and the chart below shows the size and object numbers of each table. image

After compaction, each table's files were merged into one file, one example below. Screen Shot 2022-05-24 at 5 53 28 PM

Then we did the benchmark test and the comparsion, the below is the result of the query elapsed time, the left (blue) is before the compaction, and the right (red) is after. Screen Shot 2022-05-24 at 5 41 30 PM

The samiliar downgraded performance was also observed on our data (we have large partitioned tables, and test the benchmarks on select with different filters and complex join queries)

Environment information

Willingness to contribute

The Delta Lake Community encourages bug fix contributions. Would you or another member of your organization be willing to contribute a fix for this bug to the Delta Lake code base?

vkorukanti commented 2 years ago

@Dearkano Thanks for reporting the issue. Is it possible to share the Trino query execution profile before optimize and after optimize for one of the TPCDS queries (preferably Q21 which seems to have the worst degradation)?

Dearkano commented 2 years ago

@vkorukanti sure, I will do the test and upload them later

scottsand-db commented 2 years ago

@Dearkano - thanks! Looking forward to getting those query execution profiles so we can help debug this issue!

Dearkano commented 2 years ago

@vkorukanti @scottsand-db Hi, after we dive into the queries, we found that the root cause is that: After the compaction, the parallelism of Trino is significantly reduced, this is expected since Trino has only one file (in the TPC-DS case) to deal with in each table after compaction. So CPU time after compaction is only 1/2 or even 1/10 compared to before. But in the end, the queries take a longer elapsed time, with consuming fewer resources.

Below is the comparison of the elapsed time, it should be similar to my original post, but this time, we ingested two copies of TPC-DS and then optimized one of them. Screen Shot 2022-05-27 at 6 09 05 PM

And this one is the comparison of the CPU time, we can see that the compaction saves the CPU significantly. Screen Shot 2022-05-27 at 6 09 59 PM

Take one query as an example, the query text is:

select count(*) 
from ((select distinct c_last_name, c_first_name, d_date
       from store_sales, date_dim, customer
       where store_sales.ss_sold_date_sk = date_dim.d_date_sk
         and store_sales.ss_customer_sk = customer.c_customer_sk
         and d_month_seq between 1197 and 1197+11)
       except
      (select distinct c_last_name, c_first_name, d_date
       from catalog_sales, date_dim, customer
       where catalog_sales.cs_sold_date_sk = date_dim.d_date_sk
         and catalog_sales.cs_bill_customer_sk = customer.c_customer_sk
         and d_month_seq between 1197 and 1197+11)
       except
      (select distinct c_last_name, c_first_name, d_date
       from web_sales, date_dim, customer
       where web_sales.ws_sold_date_sk = date_dim.d_date_sk
         and web_sales.ws_bill_customer_sk = customer.c_customer_sk
         and d_month_seq between 1197 and 1197+11)
) cool_cust

This is the execution details of the query before compaction. Screen Shot 2022-05-27 at 6 12 41 PM Screen Shot 2022-05-27 at 6 19 23 PM

This is the execution details of the query after compaction. Screen Shot 2022-05-27 at 6 12 53 PM Screen Shot 2022-05-27 at 6 19 43 PM

Since we do not have cases where in each partition, there are tens of GBs of data (so there will be tens of files after compaction), we do not have benchmarks on cases like this.

In conclusion, the compaction will help Trino save resources while this may sacrifice the parallelism of Trino and make queries slower if the query resource is sufficient.

Dearkano commented 2 years ago

@vkorukanti Hi, is there any update on this issue? Thank you!

vkorukanti commented 2 years ago

@Dearkano do you have the same problem on your production data? Wondering have you tried changing the max split size in Trino?

vkorukanti commented 2 years ago

@Dearkano From the screenshots you posted, it looks like the number of stages is different? Is the query plan different when running on a compacted data? Wondering if any join optimization are triggered due to less size?

Dearkano commented 2 years ago

@vkorukanti Thank you for the reply! I have not tried max-split-size, do you mean that we can use this config to increase the parallelism? I attach the query plans here. From my side, the plan is indeed different just like the stage shows. There are 10 fragments in the plan with compaction, and 16 fragments in the plan without compaction. I'm not very familiar with the query plan so I'm not sure if there is any join optimization you mentioned.

Query Plan - Before Compaction.txt Query Plan - After Compaction.txt

vkorukanti commented 2 years ago

@Dearkano, looking at the before and after query plans, it looks like the after compaction query has broadcast joins vs the hash partition join in before compaction query plans.

You could try setting the session property join_distribution_type to PARTITIONED to keep the plans same.

Regarding why the broadcast join is slow, need to repro to debug this.

zsxwing commented 1 year ago

Closing the stale issue. Feel free to reopen it if the above suggestion doesn't work

ryadav-uptycs commented 4 months ago

@Dearkano is this fixed for you? I am also facing same issue , that on compacted files queries performance is slow or same vs querying uncompacted data .