secretazianman commented 2 years ago

Tuning the split sizes can greatly improve performance for reading S3. Local HDFS will get some benefit too.

ORC Specific Issue

PrestoDB/Trino, used by Athena, has a setting where if the ORC stripe size is <8MB, it results in entire data file scan.
The Orc Stripe default size is 64 MB. Post Compression this can in a result a <8MB stripe. One can verify in Athena by observing the Data Scanned value and notice that these orc files always result in entire data scan.
By increasing Stripe size Athena will only scan the necesssary columns in ORC, reducing costs and improving speed.

ORC+Parquet+(Any splittable file type)+S3

For very large Tables we can see a substantial improvement in query response on s3 backed tables.
Let's take for example a table containing 10TB of ORC+ZLIB compressed files. Let's assume the compressed stripe size is 6MB and they're perfectly distributed in the files, this would result in (10TB x 1024 x 1024)/6MB= 1,310,720 ORC Stripes. If this was a Spark job, the driver would be spending minutes loading all of these splits into memory during the plan, increasing costs and runtime. This can be verified by observing the Driver time spent on s3 file threads with jstack.
Relevant Settings orc.stripe.size parquet.block.size

mattliemAWS commented 2 years ago

Thanks. Is your recommendation to increase the stripe/block size to reduce driver time spent loading splits in memory?

Looks like default orc.stripe.size=~200mb and default parquet.block.size=256mb. Have you found better performance with increasing these values?

secretazianman commented 2 years ago

Where did you find those default values? Is that something EMR overrides? I wasn't able to find them in the EMR hive/yarn site yamls.

Here's the source code defaults Orc Stripe 64MB https://github.com/apache/orc/blob/main/java/core/src/java/org/apache/orc/OrcConf.java#L29

Parquet block 128MB https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetWriter.java#L42

secretazianman commented 2 years ago

Also, I did find much better performance by increasing the split sizes for ORC.

I used ~4GB ORC files with 256MB stripes in a benchmark. The 256MB stripes resulted in about 28MB compressed stripe size using ZLIB. This reduced the Tez Mapper count by over 20x, the driver was about 90% faster, and a simple count(*) query was 30% faster(post driver startup). I'm not sure if s3 api costs include the get requests per individual file split, but if it does this should have a tremendous savings on s3 api usage as well.

I tried with 512MB stripes but found it may start impacting the performance of producing the source data due to larger tasks. Still, it also yielded faster query response so a tradeoff should be decided by the user.

secretazianman commented 2 years ago

Also for optimal ORC write performance I found it was best to write the files to local HDFS, concatenate, and then upload to s3.

I found that having the task size equal to the ORC split size resulted in the fastest local HDFS write, due to the small task sizes. This way each file only has a single large orc stripe in it.

Next use the Hive alter table concatenate command to merge the small files at the stripe level, this runs super fast on local HDFS. Then use the s3distcp command to upload.

I haven't explored an optimal Parquet write strategy yet.

aws / aws-emr-best-practices

Create entry in BP 5.1.4 - Optimal Split Size #9

Tuning the split sizes can greatly improve performance for reading S3. Local HDFS will get some benefit too.

ORC Specific Issue

ORC+Parquet+(Any splittable file type)+S3