Amazon CloudWatch Alarms is natively integrated with Amazon CloudWatch metrics. Many AWS services send metrics to CloudWatch, and AWS also offers many approaches that let you emit your applications’ metrics as custom metrics. CloudWatch Alarms let you monitor the metrics changes when crossing a static threshold or falling out of an anomaly detection band. Furthermore, it lets you monitor the calculated result of multiple alarms. Then, CloudWatch Alarms automatically initiate actions when its state changes between OK, ALARM, and INSUFFICIENT_DATA.
The most commonly used alarm action is to notify a person-of-interest or trigger downstream automation by sending a message to an Amazon Simple Notification Service (SNS) Topic. CloudWatch Alarms are designed to invoke only the alarm actions when a state change happens. The one exception is Autoscaling actions, where the scaling action will keep being invoked periodically when the alarm remains in the state that was configured for the action.
There are scenarios where you may find it useful to have repeated notifications on certain critical alarms so that the corresponding team is alerted to take actions promptly. In this post, I will show you how to use Amazon EventBridge, AWS Step Function, and AWS Lambda to enable repeated alarm notification on selected CloudWatch Alarms. I will also discuss the other customization use cases that can be achieved with alarm state change using the same solution model.
Since 2019, Amazon EventBridge has integrated with Amazon CloudWatch so that when a CloudWatch alarm’s state changes, a corresponding CloudWatch alarm state change event is sent to the EventBridge service. You can create an EventBridge rule with customized rule pattern to capture
Matched events mean that the rule invokes downstream automations to process the alarm’s state change event. This solution uses an AWS Step function to orchestrate repeated alarm notification workflow.
In this solution, we will enable repeated alarm notification by applying a specific tag on the CloudWatch alarm resources. Within the Step Function, a Lambda function can query the tags of the triggered alarm and only process further when the specific tag
This solution is deployed as an AWS Cloud Development Kit (CDK) application that deploys the resources highlighted within the blue rectangle in the following diagram to your AWS account. These resources are:
This solution works as follows:
With a match event, the EventBridge rule invokes the Step Function target.
Once the Step Function starts execution, it first enters a Wait state (“Wait X Seconds” as shown in the following figure). The wait period can be configured in the CDK application and passed to the state machine definition.
Then, it enters the Lambda Invocation task (“Check alarm tag and status” in the following figure).
Then, the Choice state (“Is alarm still in ALARM state?” in the following figure) checks the alarm state returned by the Lambda function and directs the workflow as follows:
The repeated notification for an alarm within the workflow above stops when:
Now, let’s deploy the solution and see how it works.
Before you can deploy a CDK application, make sure you have the AWS CDK CLI installed and AWS account bootstrapped as describe here. Then run the following command from your terminal to download the solution code and deploy.
git clone https://github.com/aws-samples/amazon-cloudwatch-alarms-repeated-notification-cdk.git
cd amazon-cloudwatch-alarms-repeated-notification-cdk
npm install
npm run build
cdk bootstrap #Required for first time CDK deployment
cdk deploy --parameters RepeatedNotificationPeriod=300 --parameters TagForRepeatedNotification=RepeatedAlarm:true --parameters RequireResourceGroup=false
With the “cdk deploy” command, you can also configure the following parameters:
RepeatedNotificationPeriod
: The time in seconds between 2 consecutive notifications from an alarm. The default is set to 300 in the CDK code.TagForRepeatedNotification
: The tag used to enable repeated notification on an alarm. It must be in a key:value pair. The default for this parameter is RepeatedAlarm:trueRequireResourceGroup
: Whether to create a tag-based resource group to monitor all CloudWatch Alarms with repeated notification enabled. Allowed values: true/false.Because this is a new deployment, you will see a summary of IAM resources to be created in the target account. These IAM resources are used by the components in the solution. No change is performed to any existing IAM resources in your account. You can review the change and accept by entering “y” to continue the deployment.
You will then see the progress of the deployment from your terminal. Wait for it to finish. You can also see the progress of the deployment from the CloudFormation.
Once the deployment completes, you can test the solution on an alarm by applying the tag that you used.
aws cloudwatch tag-resource --resource-arn arn:aws:cloudwatch:<region>:<account_id>:alarm:<alarm_name> --tags Key=RepeatedAlarm,Value=true
aws cloudwatch set-alarm-state --alarm-name <alarm_name> --state-value OK --state-reason "test"
AWS Resource Groups lets you search and group AWS resources based on tag. In this post, I will show you how to use this to have a centralized view of all of the alarms with repeated notification enabled.
Run the following CLI command to untag the CloudWatch alarm. You should see the alarm disappear from the resource group created in the previous step as well:
aws cloudwatch untag-resource --resource-arn arn:aws:cloudwatch:<region>:<account_id>:alarm:<alarm_name> --tag-keys RepeatedAlarm
Since April 2021, Amazon EventBridge started to support cross-region event routing. With the launch of this new feature, you only just need to deploy this solution in one of the supported destination region to process repeated notification workflow across alarms in any commercial AWS Region. You can choose to deploy this solution to one of support destination region as listed here. The solution is shown in the below diagram.
This framework lets you centralize alarm state change events from any commercial regions to a single supported region. This significantly reduces the operation overhead when it comes to resource management and troubleshooting.
In addition, this post mainly shows you how to use the native alarm state change event via Amazon EventBridge and AWS Step Function to enable repeated notification. However, using Amazon EventBridge to capture these events and orchestrate downstream workflow also lets you perform more advanced alarm processing tasks by utilizing various targets supported by Amazon EventBridge. For example, enrich/format/pretty-print the alarm message or execute playbooks with a Lambda Function target or SSM automation.
To avoid additional infrastructure costs from the examples described in this post, make sure that you delete all of the resources created. You can simply clean up the resources by running the following command:
cd amazon-cloudwatch-alarms-repeated-notification-cdk
cdk destroy
In addition, the Lambda created in this solution will lead to a CloudWatch Log group with the prefix “/aws/lambda/RepeatedCloudWatchAlarm”. Make sure to delete the log group to avoid CloudWatch Log storage charges.
In this post, we’ve focused on enabling repeated notification on CloudWatch Alarms by utilizing the alarm state change event via Amazon EventBridge and AWS Step Function. With this solution, hopefully you won’t miss any mission critical alarms and will improve the response time of the incident. The same framework can also be extended to handle more advanced alarm processing tasks. Please share your feedback about the solution.
See CONTRIBUTING for more information.
This library is licensed under the MIT-0 License. See the LICENSE file.