DataBiosphere / azul

Metadata indexer and query service used for AnVIL, HCA, LungMAP, and CGP
Apache License 2.0
5 stars 2 forks source link

Increase threshold for "unauthorized" metric alarm #6220

Open hannes-ucsc opened 2 months ago

hannes-ucsc commented 2 months ago

Factored out of #6203

We originally planned to drop metric and alarm but I think it might still be useful to keep the alarm and just increase the threshold so that it doesn't get tripped by SSM or AWS Config. I'm thinking 10 / day.

hannes-ucsc commented 2 months ago

Assign next.

hannes-ucsc commented 2 months ago

We'll inevitably notice the drop in alarms to be triaged during stand-up. No demo.

achave11-ucsc commented 1 month ago

@hannes-ucsc: "This change appears to have more-or-less silenced the unauthorized alarm. I think this is because average is used instead of sum. So now what we'll do is manually modify the alarm in prod and anvilprod and a follow up PR to address this issue permanently. Assignee to also check other alarms for this problem."

achave11-ucsc commented 1 month ago

The following alarms are also configured using 'Average' as apposed to 'Sum',

diff --git a/terraform/shared/shared.tf.json.template.py b/terraform/shared/shared.tf.json.template.py
index 119130d9..0a64e1a8 100644
--- a/terraform/shared/shared.tf.json.template.py
+++ b/terraform/shared/shared.tf.json.template.py
@@ -59,7 +59,7 @@ cis_alarms = [
                                    '!="AwsServiceEvent"}'),
     # https://docs.aws.amazon.com/securityhub/latest/userguide/securityhub-cis-controls.html#securityhub-cis-controls-3.4
     CloudTrailAlarm(name='iam_policy_change',
-                    statistic='Average',
+                    statistic='Sum',
                     filter_pattern='{($.eventName=DeleteGroupPolicy) || ($.eventName=DeleteRolePolicy) || '
                                    '($.eventName=DeleteUserPolicy) || ($.eventName=PutGroupPolicy) || '
                                    '($.eventName=PutRolePolicy) || ($.eventName=PutUserPolicy) || '
@@ -76,7 +76,7 @@ cis_alarms = [
                                    '($.eventName=StopLogging)}'),
     # https://docs.aws.amazon.com/securityhub/latest/userguide/securityhub-cis-controls.html#securityhub-cis-controls-3.8
     CloudTrailAlarm(name='s3_policy_change',
-                    statistic='Average',
+                    statistic='Sum',
                     filter_pattern='{($.eventSource=s3.amazonaws.com) && (($.eventName=PutBucketAcl) || '
                                    '($.eventName=PutBucketPolicy) || ($.eventName=PutBucketCors) || '
                                    '($.eventName=PutBucketLifecycle) || ($.eventName=PutBucketReplication) || '
@@ -90,14 +90,14 @@ cis_alarms = [
                                    '($.eventName=DeleteInternetGateway) || ($.eventName=DetachInternetGateway)}'),
     # https://docs.aws.amazon.com/securityhub/latest/userguide/securityhub-cis-controls.html#securityhub-cis-controls-3.13
     CloudTrailAlarm(name='route_table_change',
-                    statistic='Average',
+                    statistic='Sum',
                     filter_pattern='{($.eventName=CreateRoute) || ($.eventName=CreateRouteTable) || '
                                    '($.eventName=ReplaceRoute) || ($.eventName=ReplaceRouteTableAssociation) || '
                                    '($.eventName=DeleteRouteTable) || ($.eventName=DeleteRoute) || '
                                    '($.eventName=DisassociateRouteTable)}'),
     # https://docs.aws.amazon.com/securityhub/latest/userguide/securityhub-cis-controls.html#securityhub-cis-controls-3.14
     CloudTrailAlarm(name='vpc_change',
-                    statistic='Average',
+                    statistic='Sum',
                     filter_pattern='{($.eventName=CreateVpc) || ($.eventName=DeleteVpc) || '
                                    '($.eventName=ModifyVpcAttribute) || ($.eventName=AcceptVpcPeeringConnection) || '
                                    '($.eventName=CreateVpcPeeringConnection) || '
@@ -107,7 +107,7 @@ cis_alarms = [
                                    '($.eventName=EnableVpcClassicLink)}'),
     # https://docs.aws.amazon.com/securityhub/latest/userguide/securityhub-cis-controls.html#securityhub-cis-controls-1.1
     CloudTrailAlarm(name='root_user',
-                    statistic='Average',
+                    statistic='Sum',
                     filter_pattern='{$.userIdentity.type="Root" && $.userIdentity.invokedBy NOT EXISTS && '
                                    '$.eventType !="AwsServiceEvent"}')
 ]
achave11-ucsc commented 1 month ago

The following is an exhaustive list of AWS Metric alarms configured in the dev AWS account, for the dev deployment. All of the alarms were accounted for in the corresponding files in which they are configured. A note in the Terraform resource docs states that under some conditions this parameter (statistic) must be specified. However, omitting it altogether in the resource configuration makes the field null, when verified in the plan. While it succeeds initial TF validation, I suspect this may fail during the application step (terraform apply). This prevents the likelihood of the field being unintentionally set to 'Average' (or any other statistic). Not explicitly setting the resource's statistics parameter results in it being set to null by Terraform.

This list was composed using the aws cloudwatch describe-alarms and jq commands.

# gitlab.tf.json.template.py, configures the following aws_cloudwatch_metric_alarm
#
{"AlarmName": "azul-gitlab_data_disk_use", "Statistic": "Maximum",}
{"AlarmName": "azul-gitlab_root_disk_use", "Statistic": "Maximum",}
{"AlarmName": "azul-gitlab_cpu_use", "Statistic": "Average",}

# shared.tf.json.template.py, configures the following cloudtrail alarms …
#
{"AlarmName": "azul-api_unauthorized-dev.alarm", "Statistic": "Average",}  # => Sum
{"AlarmName": "azul-iam_policy_change-dev.alarm", "Statistic": "Average",}
{"AlarmName": "azul-s3_policy_change-dev.alarm", "Statistic": "Average",}
{"AlarmName": "azul-route_table_change-dev.alarm", "Statistic": "Average",}
{"AlarmName": "azul-vpc_change-dev.alarm", "Statistic": "Average",}
{"AlarmName": "azul-root_user-dev.alarm", "Statistic": "Average",}
# as well as these.
#
{"AlarmName": "azul-clam_fail-dev.alarm", "Statistic": "Sum",}
{"AlarmName": "azul-clamscan-dev.alarm", "Stat": "Sum"}
{"AlarmName": "azul-cloudtrail_config_change-dev.alarm", "Statistic": "Sum",}
{"AlarmName": "azul-console_no_mfa-dev.alarm", "Statistic": "Sum",}
{"AlarmName": "azul-freshclam-dev.alarm", "Stat": "Sum"}
{"AlarmName": "azul-network_gateway_change-dev.alarm", "Statistic": "Sum",}
{"AlarmName": "azul-root_usage-dev.alarm", "Statistic": "Sum",}
{"AlarmName": "azul-trail_logs-dev.alarm",  "Stat": "Sum"}

# finally, cloudwatch.tf.json.template.py configures these alarms…
#
{"AlarmName": "azul-indexer_5xx-dev", "Statistic": "Sum",}
{"AlarmName": "azul-indexer_aggregate_errors-dev.alarm", "Statistic": "Sum",}
{"AlarmName": "azul-indexer_aggregate_retry_errors-dev.alarm", "Statistic": "Sum",}
{"AlarmName": "azul-indexer_aggregate_retry_throttles-dev.alarm", "Statistic": "Sum",}
{"AlarmName": "azul-indexer_aggregate_throttles-dev.alarm", "Statistic": "Sum",}
{"AlarmName": "azul-indexer_contribute_errors-dev.alarm", "Statistic": "Sum",}
{"AlarmName": "azul-indexer_contribute_retry_errors-dev.alarm", "Statistic": "Sum",}
{"AlarmName": "azul-indexer_contribute_retry_throttles-dev.alarm", "Statistic": "Sum",}
{"AlarmName": "azul-indexer_contribute_throttles-dev.alarm", "Statistic": "Sum",}
{"AlarmName": "azul-indexer_errors-dev.alarm", "Statistic": "Sum",}
{"AlarmName": "azul-indexer_forward_alb_logs_errors-dev.alarm", "Statistic": "Sum",}
{"AlarmName": "azul-indexer_forward_alb_logs_throttles-dev.alarm", "Statistic": "Sum",}
{"AlarmName": "azul-indexer_forward_s3_logs_errors-dev.alarm", "Statistic": "Sum",}
{"AlarmName": "azul-indexer_forward_s3_logs_throttles-dev.alarm", "Statistic": "Sum",}
{"AlarmName": "azul-indexer_indexercachehealth_errors-dev.alarm", "Statistic": "Sum",}
{"AlarmName": "azul-indexer_indexercachehealth_throttles-dev.alarm", "Statistic": "Sum",}
{"AlarmName": "azul-indexer_throttles-dev.alarm", "Statistic": "Sum",}
{"AlarmName": "azul-indexercachehealth-dev.alarm", "Stat": "Sum"}
{"AlarmName": "azul-internet_egress-dev", "m1": {"Stat": "Sum"}, "m2": {"Stat": "Sum"}}
{"AlarmName": "azul-internet_ingress-dev", "m1": {"Stat": "Sum"}, "m2": {"Stat": "Sum"}}
{"AlarmName": "azul-service_5xx-dev", "Statistic": "Sum",}
{"AlarmName": "azul-service_errors-dev.alarm", "Statistic": "Sum",}
{"AlarmName": "azul-service_manifest_errors-dev.alarm", "Statistic": "Sum",}
{"AlarmName": "azul-service_manifest_throttles-dev.alarm", "Statistic": "Sum",}
{"AlarmName": "azul-service_servicecachehealth_errors-dev.alarm", "Statistic": "Sum",}
{"AlarmName": "azul-service_servicecachehealth_throttles-dev.alarm", "Statistic": "Sum",}
{"AlarmName": "azul-service_throttles-dev.alarm", "Statistic": "Sum",}
{"AlarmName": "azul-servicecachehealth-dev.alarm", "Stat": "Sum"}
{"AlarmName": "azul-waf_blocked-dev", "m1": {"Stat": "Sum"}, "m2": {"Stat": "Sum"}}
{"AlarmName": "azul-waf_rate_blocked-dev", "Statistic": "Sum",}
{"AlarmName": "azul-vpn_egress-dev", "Stat": "Sum"}
{"AlarmName": "azul-vpn_ingress-dev", "Stat": "Sum"}
achave11-ucsc commented 1 month ago

{"AlarmName": "azul-api_unauthorized-dev.alarm", "Statistic": "Average",} {"AlarmName": "azul-iam_policy_change-dev.alarm", "Statistic": "Average",} {"AlarmName": "azul-s3_policy_change-dev.alarm", "Statistic": "Average",} {"AlarmName": "azul-route_table_change-dev.alarm", "Statistic": "Average",} {"AlarmName": "azul-vpc_change-dev.alarm", "Statistic": "Average",} {"AlarmName": "azul-root_user-dev.alarm", "Statistic": "Average",}

@hannes-ucsc: "These should all be using sum for statistic. The only legitimate use of average is the gitlab_cpu_use alarm."

hannes-ucsc commented 1 week ago

There is an off-by-one error in the implementation. The alarm is triggered with 12 or more events. It should be triggered with 13 or more.

hannes-ucsc commented 1 week ago

This remains, no demo. We will continue to triage the alarms and count the daily number of matching events in order to determine if the off-by-one is fixed.