aws / aws-emr-best-practices

A best practices guide for using AWS EMR. The guide will cover best practices on the topics of cost, performance, security, operational excellence, reliability and application specific best practices across Spark, Hive, Hudi, Hbase and more.
Other
101 stars 27 forks source link

Create Section on Maximizing HDFS Read/Write Throughput Cost Performance #10

Open secretazianman opened 2 years ago

secretazianman commented 2 years ago

There isn't currently a good guide that I could find anywhere for maximizing HDFS read/write throughput.

Example: 10TB Dataset that is copied to Local HDFS for local processing

Optimizing the instance selection for Cost

Optimizing EBS Volume Count for Instance Types

The GP2 Volume has the following note "The throughput limit is between 128 MiB/s and 250 MiB/s, depending on the volume size. Volumes smaller than or equal to 170 GiB deliver a maximum throughput of 128 MiB/s. Volumes larger than 170 GiB but smaller than 334 GiB deliver a maximum throughput of 250 MiB/s if burst credits are available. Volumes larger than or equal to 334 GiB deliver 250 MiB/s regardless of burst credits"

mattliemAWS commented 2 years ago

Now that EMR supports gp3, allowing you to size throughput/iops independent of size, do you think this is still needed?

secretazianman commented 2 years ago

Hey Matt, the throughput configuration wasn't available 3 months ago :). Looks like we need to update our code set to use gp3! (https://github.com/aws/aws-sdk/issues/29#issuecomment-1172671844)

I think it's still a good idea to have a section about network bandwidth throughput, gp3 settings and instance selection. There's a brief section in the AWS EMR Whitepaper Ganglia section referencing IO wait, but that's about it that I've seen referencing network/disk bandwidth optimizations.

It may make sense for those with a more sysadmin/hadoop admin background to immediately look at maximizing network/ebs/disk throughputs but it doesn't for our end reporting/analytic users.

mattliemAWS commented 2 years ago

makes sense. appreciate the recommendation!! Will add a section on this. many customers use HDFS for intermediate data, shuffle etc so being aware of optimizing disk/throughput is a good optimization.