Closed ericjsilva closed 2 years ago
@dubiety any chance this enhancement could get reviewed and accepted?
Hi @ericjsilva,
Thank you for the great PR and sorry for the late reply. I'll review it today and merge it soon.
PR merged. The PR is also included in v2.2.0 release.
@dubiety and @ericjsilva So, I believe this issue and this PR merged is actually an error folks. I'd recommend reverting this PR immediately. Per the AWS documentation.
See: https://docs.aws.amazon.com/opensearch-service/latest/developerguide/cloudwatch-alarms.html
FreeStorageSpace minimum is <= 20480 for 1 minute, 1 consecutive time
A node in your cluster is down to 20 GiB of free storage space. See Lack of available storage space. This value is in MiB, so rather than 20480, we recommend setting it to 25% of the storage space for each node.
The way they describe this "best practice" alarm is if ANY node goes below a threshold. You will never get this from an SUM() of the total free disk space across your nodes. Each individual node can (and will) itself lock writes based on its INDIVIDUAL low disk. This described metric and alarm is NOT intended to be an aggregate-based metric that triggers errors.
I highly recommend you immediately revert this PR.
Separately, if you'd like to ALSO have an SUM-based alert that's fine, please go ahead and do that. But, the MINIMUM-based alarm is CRITICAL for all deployments and should never be removed or overriden because it's an multi-cluster node, because data doesn't magically go evenly to all your ES nodes. I'm using this module on about 30 different clusters I manage over the last few years, and I can tell you from experience, this SUM based alert would not have caught issues that I've had to deal with.
@AndrewFarley I agree with your analysis, and I think this could be improved with additional information in the module about how the Minimum metric actually works in a multi-node cluster. It reports on the lowest disk size of any node in your cluster and not the overall combined space of your cluster.
For example, we have a 6 data node and 3 master node cluster. If each data node had 1TB of disk storage, you would expect your cluster to have 6TB of space and setting a free space alarm of 25% of 6TB is incorrect. You must set the alarm to 25% of 1TB or 250GB, otherwise you will get false positive alarms.
@dubiety I think the README and variable description should be updated to add clarity.
Per AWS documentation, the
Minimum
andMaximum
metrics show the free space for a single node andSum
shows the free space across all nodes in the cluster.The module should accept a variable to handle multi-node clusters and set the statistic accordingly.
https://github.com/dubiety/terraform-aws-elasticsearch-cloudwatch-sns-alarms/blob/eb23546f6ba48dae4b59e6487a87e1bc62a55d1f/alarms.tf#L64