dubiety / terraform-aws-elasticsearch-cloudwatch-sns-alarms

terraform module that configures important alarms for AWS elasticsearch and send them to SNS topic
Apache License 2.0
30 stars 45 forks source link

Incorrect statistic being used for free space alarm #16

Closed ericjsilva closed 2 years ago

ericjsilva commented 2 years ago

Per AWS documentation, the Minimum and Maximum metrics show the free space for a single node and Sum shows the free space across all nodes in the cluster.

The module should accept a variable to handle multi-node clusters and set the statistic accordingly.

https://github.com/dubiety/terraform-aws-elasticsearch-cloudwatch-sns-alarms/blob/eb23546f6ba48dae4b59e6487a87e1bc62a55d1f/alarms.tf#L64

ericjsilva commented 2 years ago

@dubiety any chance this enhancement could get reviewed and accepted?

dubiety commented 2 years ago

Hi @ericjsilva,

Thank you for the great PR and sorry for the late reply. I'll review it today and merge it soon.

dubiety commented 2 years ago

PR merged. The PR is also included in v2.2.0 release.

AndrewFarley commented 2 years ago

@dubiety and @ericjsilva So, I believe this issue and this PR merged is actually an error folks. I'd recommend reverting this PR immediately. Per the AWS documentation.

See: https://docs.aws.amazon.com/opensearch-service/latest/developerguide/cloudwatch-alarms.html

FreeStorageSpace minimum is <= 20480 for 1 minute, 1 consecutive time
A node in your cluster is down to 20 GiB of free storage space. See Lack of available storage space. This value is in MiB, so rather than 20480, we recommend setting it to 25% of the storage space for each node.

The way they describe this "best practice" alarm is if ANY node goes below a threshold. You will never get this from an SUM() of the total free disk space across your nodes. Each individual node can (and will) itself lock writes based on its INDIVIDUAL low disk. This described metric and alarm is NOT intended to be an aggregate-based metric that triggers errors.

I highly recommend you immediately revert this PR.

Separately, if you'd like to ALSO have an SUM-based alert that's fine, please go ahead and do that. But, the MINIMUM-based alarm is CRITICAL for all deployments and should never be removed or overriden because it's an multi-cluster node, because data doesn't magically go evenly to all your ES nodes. I'm using this module on about 30 different clusters I manage over the last few years, and I can tell you from experience, this SUM based alert would not have caught issues that I've had to deal with.

ericjsilva commented 2 years ago

@AndrewFarley I agree with your analysis, and I think this could be improved with additional information in the module about how the Minimum metric actually works in a multi-node cluster. It reports on the lowest disk size of any node in your cluster and not the overall combined space of your cluster.

For example, we have a 6 data node and 3 master node cluster. If each data node had 1TB of disk storage, you would expect your cluster to have 6TB of space and setting a free space alarm of 25% of 6TB is incorrect. You must set the alarm to 25% of 1TB or 250GB, otherwise you will get false positive alarms.

@dubiety I think the README and variable description should be updated to add clarity.