Analyze IOPS usage patterns and consider switching to PIOPS

The recent deploy has seen a considerable delay - @dpb587 writes:

I think the slowness is due to the rebalancing. The disks currently have higher usage rates and one of them is being heavily used. I've paused rebalancing so when the current rebalancing shards finish, I think things will pick back up.

I think it's slightly overdue to get on top of our IOPS footprint and have taken a closer look this time around - it's well known that disks queuing up events can massively slow systems down towards a perceived stop even (because all the CPU/memory won't help), so the key metric for the scenario at hand is VolumeQueueLength as reported by AWS CloudWatch (the only source of truth in this virtualized environment) - the high disk usage and perceived slowness correlates with the resp. CloudWatch EBS metrics indeed:

VolumeQueueLength

VolumeQueueLength should obviously be as low as possible (i.e. below 1 ideally) and is clearly averaging around 20 here over an extensive time frame, causing resp. on instance CPU I/O Wait Percentage in turn:

I/O Wait

We are not aggregating the CPUs yet (see #228), which blurs the impression a bit, but sufficient to say that there's a whole lot of I/O waiting going on ...

I/O Operations / Second (IOPS)

Looking at the VolumeReadOpsand (most significant here) VolumeWriteOps reveals that we are operating at the maximum IOPS level for standard EBS volumes indeed:

The baseline for standard EBS volumes is about 100 IOPS on average (see e.g. Fast Forward - Provisioned IOPS for EBS Volumes):

As a point of reference, a standard EBS volume will generally provide about 100 IOPS on average, with the ability to burst to hundreds of IOPS on a best-effort basis. [...]

The math is as follows (see Monitoring Volumes with CloudWatch):

To calculate the average I/O operations per second (IOPS) for the period, divide the total operations in the period by the number of seconds in that period.

This yields (~0 + ~24000) / 300 = 80 for the beginning of the spike here, with a later maximum to the right of the graph of about (~1000 + ~38500) = ~131, so clearly within the documented corridor.

:information_source: If I read things and timing correctly, the short period between the high load in both graphs correlates with @dpb587s pausing and reactivating of the shard rebalancing.
Options

AWS features a dedicated section on Amazon EBS Volume Performance - the two most important aspects right now are probably the following:

Volume Provisioning

To reduce the duration of deployments based on new EBS volumes and depending how our deployment process evolves, we might consider Pre-Warming Amazon EBS Volumes (which might have been done already though, see below):

There is a 5 to 50 percent reduction in IOPS when you first access each block of data on a newly created or restored EBS volume. You can avoid this performance hit by accessing each block in advance. For more information, see Pre-Warming Amazon EBS Volumes.

:exclamation: The observed metrics seem to suggest that this hasn't been an issue here, so might have been done already one way or another, i.e. explicit or implicit; after all, the volumes in question are supposed to be reused currently and I do not know/when how they came to life this time around (I'm still in doubt about that solution btw., but that's a different topic for another day).

Either way, while important for smooth deployments down the road, the more crucial aspect is obviously the ongoing operation (including deployments), which can probably be improved by facilitating Provisioned IOPS:

Provisioned IOPS

A reasonable assumption (also backed by resp. 'red' hints from our ElasticHQ monitoring) is that a comparatively I/O heavy solution like LogSearch would greatly benefit from Provisioned IOPS Volumes indeed, which are designed to meet the needs of I/O-intensive workloads, particularly database workloads, that are sensitive to storage performance and consistency in random access I/O throughput.

Given we have based things on CloudFormation, this can be easily added to the stack.

:exclamation: Obviously we need to properly test and subsequently monitor the EBS performance in an ongoing fashion to ensure basing resp. PIOPS levels/adjustments on observed metrics.
Disclaimer / Feedback

As easily seen, I'm not exactly an expert in the subject matter and might even be entirely off with my analysis and/or deductions, so please draw your own conclusions and correct me as needed, any feedback would be highly appreciated.

cityindex-attic / logsearch