Closed dsotirho-ucsc closed 2 months ago
@hannes-ucsc: "Assignee to investigate if a metric math query or some other aggregation mechanism can be used to only trigger the alarm if the number of blocked requests exceeds twice the request rate limit for any given client IP. This would only work if the client IP is tracked as a dimension in the datapoints published by WAF."
@hannes-ucsc: "Based on @dsotirho-ucsc's research, it appears highly unlikely that WAF includes the specific IP address of a client as a dimension in any of the CloudWatch metrics that it publishes. https://docs.aws.amazon.com/waf/latest/developerguide/waf-metrics.html Assignee to explore the possibility of having two different rate limiting rules: one for returning a 429 response with Retry-After at a lower limit, and another one to trigger the alarm, at a higher limit."
Tested with two separate WAF rate based rules set at the same time, one with a limit of 1000/5min (per IP) to block requests, and one with a limit of 2000/5min (per IP) that blocks and also trips an alarm. In order for this to work the rate rule with the larger limit has a lower priority, meaning it is evaluated first.
The lower limit rate rule successfully blocked requests when the request rate exceeded the threshold, and the higher limit rate rule successfully tripped the alarm when its threshold was exceeded.
Index: src/azul/__init__.py
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/azul/__init__.py b/src/azul/__init__.py
--- a/src/azul/__init__.py (revision 6a6ea97331c43b0f249455b7d315804fa16d1c11)
+++ b/src/azul/__init__.py (date 1718214034766)
@@ -1566,8 +1566,14 @@
allowed_v4_ips_term = 'allowed_v4_ips'
+ # We use two rate rules, one that blocks requests when a lower threshold is
+ # exceeded, and one that triggers an alarm when a higher threshold is
+ # exceeded.
+
waf_rate_rule_name = 'RateRule'
+ waf_rate_alarm_rule_name = 'RateAlarmRule'
+
waf_rate_rule_period = 300 # seconds; this value is fixed by AWS
waf_rate_rule_retry_after = 30 # seconds
Index: terraform/cloudwatch.tf.json.template.py
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/terraform/cloudwatch.tf.json.template.py b/terraform/cloudwatch.tf.json.template.py
--- a/terraform/cloudwatch.tf.json.template.py (revision 6a6ea97331c43b0f249455b7d315804fa16d1c11)
+++ b/terraform/cloudwatch.tf.json.template.py (date 1718152007910)
@@ -316,7 +316,7 @@
'dimensions': {
'WebACL': '${aws_wafv2_web_acl.api_gateway.name}',
'Region': config.region,
- 'Rule': config.waf_rate_rule_name
+ 'Rule': config.waf_rate_alarm_rule_name
},
'alarm_actions': ['${data.aws_sns_topic.monitoring.arn}'],
'ok_actions': ['${data.aws_sns_topic.monitoring.arn}'],
Index: terraform/api_gateway.tf.json.template.py
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/terraform/api_gateway.tf.json.template.py b/terraform/api_gateway.tf.json.template.py
--- a/terraform/api_gateway.tf.json.template.py (revision 6a6ea97331c43b0f249455b7d315804fa16d1c11)
+++ b/terraform/api_gateway.tf.json.template.py (date 1718226011468)
@@ -242,33 +242,44 @@
('AllowedIPs', 'allow', config.allowed_v4_ips_term)
]
],
- {
- 'name': config.waf_rate_rule_name,
- 'action': {
- 'block': {
- 'custom_response': {
- 'response_code': 429,
- 'response_header': [
- {
- 'name': 'Retry-After',
- 'value': str(config.waf_rate_rule_retry_after)
- }
- ]
- }
- }
- },
- 'statement': {
- 'rate_based_statement': {
- 'limit': config.waf_rate_rule_limit,
- 'aggregate_key_type': 'IP'
- }
- },
- 'visibility_config': {
- 'metric_name': config.waf_rate_rule_name,
- 'sampled_requests_enabled': True,
- 'cloudwatch_metrics_enabled': True
+ *[
+ {
+ 'name': name,
+ 'action': {
+ 'block': {
+ 'custom_response': {
+ 'response_code': 429,
+ 'response_header': [
+ {
+ 'name': 'Retry-After',
+ 'value': str(config.waf_rate_rule_retry_after)
+ }
+ ]
+ }
+ }
+ },
+ 'statement': {
+ 'rate_based_statement': {
+ 'limit': limit,
+ 'aggregate_key_type': 'IP'
+ }
+ },
+ 'visibility_config': {
+ 'metric_name': name,
+ 'sampled_requests_enabled': True,
+ 'cloudwatch_metrics_enabled': True
+ }
}
- },
+ for name, limit in [
+ # The rate rule with the larger limit needs
+ # to be defined first, otherwise the rule
+ # with the smaller limit would trigger first
+ # and prevent evaluation of any following
+ # rules.
+ (config.waf_rate_alarm_rule_name, config.waf_rate_rule_limit * 2),
+ (config.waf_rate_rule_name, config.waf_rate_rule_limit),
+ ]
+ ],
{
'name': 'AWS-CommonRuleSet',
'override_action': {
Assignee to prepare PR.
For demo, run the flooder script with three different rates, below, at and above the limit. Show that an alarm is only triggered for the latter rate.
We had to modify the script to successfully complete the demo. I'll be committing these changes at a later time in one of my PRs.
Note that in this example, there was one 10 min period on 2024-06-02 where the user's request rate exceeded 1000/5min. This is a valid situation for the waf_rate_blocked alarm to fire.
Later on 2024-06-04 however, there are multiple occurrences where the user's request rate peaked at or just above 1000/5min and then dropped back down below the threshold. These are the cases which should be considered non-abusive and not fire the alarm.
CloudWatch logs (prod)):
CloudWatch logs (prod)):