Create Section on Maximizing HDFS Read/Write Throughput Cost Performance

secretazianman commented 2 years ago

There isn't currently a good guide that I could find anywhere for maximizing HDFS read/write throughput.

Example: 10TB Dataset that is copied to Local HDFS for local processing

Optimizing the instance selection for Cost

Many of the NVME backed instances are bottle-necked by Network Bandwidth for large data transfers plus the on-demand costs are higher than equivalent GP2 EBS sized volumes. The C6gn family offers an ideal balance of Network and EBS Bandwidth for large datasets. Ganglia makes finding the bottlenecks easier. It does not capture EBS bandwidth per volume so you'll have to use the EBS metrics in Cloudwatch.

c6gn.2xlarge | Up to 25 Network Bandwidth (Gbps) | Up to 9.5 EBS Bandwidth (Gbps) | $0.3456 c6gn.4xlarge | 25 EBS Bandwidth | 9.5 EBS Bandwidth (Gbps) | $0.6912 c6gn.8xlarge | 50 | 19 EBS Bandwidth (Gbps) | $1.3824
Notice that moving from a 2xlarge to a 4xlarge results in double the cost but not double the network bandwidth. Settling for the C6gn.2xlarge will net the best network/cost ratio.

Optimizing EBS Volume Count for Instance Types

The GP2 Volume has the following note "The throughput limit is between 128 MiB/s and 250 MiB/s, depending on the volume size. Volumes smaller than or equal to 170 GiB deliver a maximum throughput of 128 MiB/s. Volumes larger than 170 GiB but smaller than 334 GiB deliver a maximum throughput of 250 MiB/s if burst credits are available. Volumes larger than or equal to 334 GiB deliver 250 MiB/s regardless of burst credits"

In order to maximize EC2 EBS throughput the formula is EC2 EBS Bandwidth/GP2 Throughput.
9.5GBS / 250 MiB/s = ~5 GP2 Volumes
So an cost/performant HDFS Core node would be the c6gn.2xlarge with 5 GP2 Volumes.

mattliemAWS commented 2 years ago

Now that EMR supports gp3, allowing you to size throughput/iops independent of size, do you think this is still needed?

secretazianman commented 2 years ago

Hey Matt, the throughput configuration wasn't available 3 months ago :). Looks like we need to update our code set to use gp3! (https://github.com/aws/aws-sdk/issues/29#issuecomment-1172671844)

I think it's still a good idea to have a section about network bandwidth throughput, gp3 settings and instance selection. There's a brief section in the AWS EMR Whitepaper Ganglia section referencing IO wait, but that's about it that I've seen referencing network/disk bandwidth optimizations.

It may make sense for those with a more sysadmin/hadoop admin background to immediately look at maximizing network/ebs/disk throughputs but it doesn't for our end reporting/analytic users.

mattliemAWS commented 2 years ago

makes sense. appreciate the recommendation!! Will add a section on this. many customers use HDFS for intermediate data, shuffle etc so being aware of optimizing disk/throughput is a good optimization.

aws / aws-emr-best-practices