aws / aws-emr-best-practices

A best practices guide for using AWS EMR. The guide will cover best practices on the topics of cost, performance, security, operational excellence, reliability and application specific best practices across Spark, Hive, Hudi, Hbase and more.
Other
99 stars 26 forks source link

Create entry in BP 5.1.4 - Optimal Split Size #9

Open secretazianman opened 2 years ago

secretazianman commented 2 years ago

Tuning the split sizes can greatly improve performance for reading S3. Local HDFS will get some benefit too.

ORC Specific Issue

ORC+Parquet+(Any splittable file type)+S3

mattliemAWS commented 2 years ago

Thanks. Is your recommendation to increase the stripe/block size to reduce driver time spent loading splits in memory?

Looks like default orc.stripe.size=~200mb and default parquet.block.size=256mb. Have you found better performance with increasing these values?

secretazianman commented 2 years ago

Where did you find those default values? Is that something EMR overrides? I wasn't able to find them in the EMR hive/yarn site yamls.

Here's the source code defaults Orc Stripe 64MB https://github.com/apache/orc/blob/main/java/core/src/java/org/apache/orc/OrcConf.java#L29

Parquet block 128MB https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetWriter.java#L42

secretazianman commented 2 years ago

Also, I did find much better performance by increasing the split sizes for ORC.

I used ~4GB ORC files with 256MB stripes in a benchmark. The 256MB stripes resulted in about 28MB compressed stripe size using ZLIB. This reduced the Tez Mapper count by over 20x, the driver was about 90% faster, and a simple count(*) query was 30% faster(post driver startup). I'm not sure if s3 api costs include the get requests per individual file split, but if it does this should have a tremendous savings on s3 api usage as well.

I tried with 512MB stripes but found it may start impacting the performance of producing the source data due to larger tasks. Still, it also yielded faster query response so a tradeoff should be decided by the user.

secretazianman commented 2 years ago

Also for optimal ORC write performance I found it was best to write the files to local HDFS, concatenate, and then upload to s3.

I found that having the task size equal to the ORC split size resulted in the fastest local HDFS write, due to the small task sizes. This way each file only has a single large orc stripe in it.

Next use the Hive alter table concatenate command to merge the small files at the stripe level, this runs super fast on local HDFS. Then use the s3distcp command to upload.

I haven't explored an optimal Parquet write strategy yet.