cds-snc / forms-terraform

Infrastructure as Code for the GC Forms environment
MIT License
16 stars 7 forks source link

fix: add HealthyHostCount alarms to App, IdP, API #818

Closed patheard closed 2 months ago

patheard commented 2 months ago

Summary

Update the App, IdP and API unhealthy host alarms to only trigger warnings that post to Slack.

Add healthy host alarms that trigger SEV1 OpsGenie responses when a service has no healthy hosts. This will currently only trigger an OpsGenie page for the App load balancer target groups.

Related

github-actions[bot] commented 2 months ago

⚠ Terrform update available

Terraform: 1.9.5 (using 1.9.2)
Terragrunt: 0.67.4 (using 0.63.2)
github-actions[bot] commented 2 months ago

Staging: alarms

✅   Terraform Init: success ✅   Terraform Validate: success ✅   Terraform Format: success ✅   Terraform Plan: success ✅   Conftest: success

⚠️   Warning: resources will be destroyed by this change!

Plan: 6 to add, 3 to change, 2 to destroy
Show summary | CHANGE | NAME | |----------|--------------------------------------------------------------------| | add | `aws_cloudwatch_metric_alarm.ELB_healthy_hosts` | | | `aws_cloudwatch_metric_alarm.api_lb_healthy_host_count[0]` | | | `aws_cloudwatch_metric_alarm.idb_lb_healthy_host_count["HTTP1"]` | | | `aws_cloudwatch_metric_alarm.idb_lb_healthy_host_count["HTTP2"]` | | update | `aws_cloudwatch_dashboard.forms_service_health` | | | `aws_cloudwatch_metric_alarm.idb_lb_unhealthy_host_count["HTTP1"]` | | | `aws_cloudwatch_metric_alarm.idb_lb_unhealthy_host_count["HTTP2"]` | | recreate | `aws_cloudwatch_metric_alarm.UnHealthyHostCount-TargetGroup1` | | | `aws_cloudwatch_metric_alarm.UnHealthyHostCount-TargetGroup2` |

✂   Warning: plan has been truncated! See the full plan in the logs.

Show plan ```terraform Resource actions are indicated with the following symbols: + create ~ update in-place -/+ destroy and then create replacement Terraform will perform the following actions: # aws_cloudwatch_dashboard.forms_service_health will be updated in-place ~ resource "aws_cloudwatch_dashboard" "forms_service_health" { ~ dashboard_body = jsonencode( { - widgets = [ - { - height = 8 - properties = { - metrics = [ - [ - "AWS/RDS", - "CPUUtilization", - "DBClusterIdentifier", - "forms-staging-db-cluster", - { - color = "#17becf" - region = "ca-central-1" }, ], ] - period = 60 - region = "ca-central-1" - sparkline = true - stacked = false - stat = "Average" - title = "DB: CPU use" - view = "timeSeries" } - type = "metric" - width = 6 - x = 0 - y = 111 }, - { - height = 8 - properties = { - metrics = [ - [ - "AWS/RDS", - "FreeableMemory", - "DBClusterIdentifier", - "forms-staging-db-cluster", - { - color = "#9467bd" }, ], ] - period = 60 - region = "ca-central-1" - sparkline = true - stacked = false - stat = "Average" - title = "DB: freeable memory" - view = "timeSeries" } - type = "metric" - width = 6 - x = 6 - y = 111 }, - { - height = 8 - properties = { - metrics = [ - [ - "AWS/RDS", - "ReadLatency", - "DBClusterIdentifier", - "forms-staging-db-cluster", - { - color = "#c5b0d5" }, ], ] - period = 60 - region = "ca-central-1" - sparkline = true - stacked = false - stat = "Average" - title = "DB: read latency" - view = "timeSeries" } - type = "metric" - width = 6 - x = 12 - y = 111 }, - { - height = 8 - properties = { - metrics = [ - [ - "AWS/RDS", - "WriteLatency", - "DBClusterIdentifier", - "forms-staging-db-cluster", - { - color = "#7f7f7f" }, ], ] - period = 60 - region = "ca-central-1" - sparkline = true - stacked = false - stat = "Average" - title = "DB: write latency" - view = "timeSeries" } - type = "metric" - width = 6 - x = 18 - y = 111 }, - { - height = 2 - properties = { - background = "transparent" - markdown = <<-EOT # Form submissions Tracking form submissions flow through the system. EOT } - type = "text" - width = 24 - x = 0 - y = 0 }, - { - height = 6 - properties = { - metrics = [ - [ - "AWS/SQS", - "NumberOfMessagesReceived", - "QueueName", - "submission_processing.fifo", - { - color = "#8c564b" }, ], ] - period = 300 - region = "ca-central-1" - stacked = false - stat = "Sum" - title = "Queue: submission messages" - view = "timeSeries" } - type = "metric" - width = 8 - x = 0 - y = 14 }, - { - height = 6 - properties = { - metrics = [ - [ - "AWS/SQS", - "ApproximateAgeOfOldestMessage", - "QueueName", - "submission_processing.fifo", - { - color = "#7f7f7f" - label = "Oldest message age" - region = "ca-central-1" }, ], ] - period = 300 - region = "ca-central-1" - sparkline = true - stat = "Average" - title = "Queue: submission message age" - view = "singleValue" } - type = "metric" - width = 4 - x = 8 - y = 14 }, - { - height = 3 - properties = { - background = "transparent" - markdown = <<-EOT # Form responses Tracking form response list, retrieval and confirm. EOT } - type = "text" - width = 24 - x = 0 - y = 20 }, - { - height = 3 - properties = { - background = "transparent" - markdown = <<-EOT ## Lambdas Performance metrics for the Submission and Reliability functions. EOT } - type = "text" - width = 24 - x = 0 - y = 83 }, - { - height = 7 - properties = { - metrics = [ - [ - "AWS/ECS", - "CPUUtilization", - "ServiceName", - "form-viewer", - "ClusterName", - "Forms", - { - region = "ca-central-1" - stat = "Minimum" }, ], - [ - "...", - { - region = "ca-central-1" - stat = "Maximum" }, ], - [ - "...", - { - region = "ca-central-1" - stat = "Average" }, ], ] - period = 300 - region = "ca-central-1" - stacked = false - title = "App: CPU use" - view = "timeSeries" } - type = "metric" - width = 8 - x = 0 - y = 76 }, - { - height = 7 - properties = { - metrics = [ - [ - "AWS/ECS", - "MemoryUtilization", - "ServiceName", - "form-viewer", - "ClusterName", - "Forms", - { - stat = "Minimum" }, ], - [ - "...", - { - stat = "Maximum" }, ], - [ - "...", - { - stat = "Average" }, ], ] - period = 300 - region = "ca-central-1" - stacked = false - title = "App: memory use" - view = "timeSeries" } - type = "metric" - width = 8 - x = 8 - y = 76 }, - { - height = 3 - properties = { - background = "transparent" - markdown = <<-EOT ## Load balancer Requests, errors and response time for the app's load balancer. EOT } - type = "text" - width = 24 - x = 0 - y = 98 }, - { - height = 6 - properties = { - metrics = [ - [ - "AWS/Lambda", - "Invocations", - "FunctionName", - "Submission", - { - region = "ca-central-1" }, ], - [ - ".", - "Throttles", - ".", - ".", - { - color = "#ffbb78" - region = "ca-central-1" }, ], - [ - ".", - "Errors", - ".", - ".", - { - color = "#d62728" - region = "ca-central-1" }, ], ] - period = 300 - region = "ca-central-1" - stacked = false - stat = "Sum" - title = "Lambda: submission" - view = "timeSeries" } - type = "metric" - width = 18 - x = 0 - y = 86 }, - { - height = 6 - properties = { - metrics = [ - [ - "AWS/Lambda", - "Duration", - "FunctionName", - "Submission", - "Resource", - "Submission", - { - color = "#555555" - region = "ca-central-1" }, ], ] - period = 300 - region = "ca-central-1" - sparkline = true - stacked = false - stat = "Average" - title = "Lambda: submission duration" - view = "singleValue" } - type = "metric" - width = 6 - x = 18 - y = 86 }, - { - height = 6 - properties = { - metrics = [ - [ - "AWS/Lambda", - "Invocations", - "FunctionName", - "reliability", - "Resource", - "reliability", - { - region = "ca-central-1" }, ], - [ - ".", - "Throttles", - ".", - ".", - ".", - ".", - { - color = "#ffbb78" - region = "ca-central-1" }, ], - [ - ".", - "Errors", - ".", - ".", - ".", - ".", - { - color = "#d62728" - region = "ca-central-1" }, ], ] - period = 300 - region = "ca-central-1" - stacked = false - stat = "Sum" - title = "Lambda: reliability" - view = "timeSeries" } - type = "metric" - width = 18 - x = 0 - y = 92 }, - { - height = 6 - properties = { - metrics = [ - [ - "AWS/Lambda", - "Duration", - "FunctionName", - "reliability", - "Resource", - "reliability", - { - color = "#555" }, ], ] - period = 300 - region = "ca-central-1" - sparkline = true - stacked = false - stat = "Average" - title = "Lambda: reliabiity duration" - view = "singleValue" } - type = "metric" - width = 6 - x = 18 - y = 92 }, - { - height = 7 - properties = { - metrics = [ - [ - "ECS/ContainerInsights", - "NetworkRxBytes", - "ClusterName", - "Forms", - { - color = "#1f77b4" - region = "ca-central-1" }, ], ] - period = 300 - region = "ca-central-1" - stacked = false - stat = "Sum" - title = "App: network bytes" - view = "timeSeries" } - type = "metric" - width = 8 - x = 16 - y = 76 }, - { - height = 7 - properties = { - metrics = [ - [ - "AWS/ApplicationELB", - "RequestCount", - "LoadBalancer", - "app/form-viewer/5e6bc2d9ab810b68", - { - color = "#2ca02c" - label = "Request count" - region = "ca-central-1" }, ], - [ - ".", - "HTTPCode_ELB_4XX_Count", - ".", - ".", - { - color = "#ffbb78" - label = "4XX response count" - region = "ca-central-1" }, ], - [ - ".", - "HTTPCode_ELB_5XX_Count", - ".", - ".", - { - color = "#d62728" - label = "5XX response count" - region = "ca-central-1" }, ], ] - period = 300 - region = "ca-central-1" - stacked = false - stat = "Sum" - title = "LB: requests" - view = "timeSeries" } - type = "metric" - width = 9 - x = 0 - y = 101 }, - { - height = 7 - properties = { - metrics = [ - [ - "AWS/ApplicationELB", - "TargetResponseTime", - "LoadBalancer", - "app/form-viewer/5e6bc2d9ab810b68", - { - color = "#8c564b" - region = "ca-central-1" }, ], ] - period = 300 - region = "ca-central-1" - sparkline = true - stacked = false - stat = "Average" - title = "LB: response time" - view = "singleValue" } - type = "metric" - width = 6 - x = 18 - y = 101 }, - { - height = 3 - properties = { - background = "transparent" - markdown = <<-EOT ## Database Performance metrics for the database cluster. EOT } - type = "text" - width = 24 - x = 0 - y = 108 }, - { - height = 7 - properties = { - metrics = [ - [ - "AWS/ApplicationELB", - "ActiveConnectionCount", - "LoadBalancer", - "app/form-viewer/5e6bc2d9ab810b68", - { - color = "#e377c2" }, ], ] - period = 300 - region = "ca-central-1" - stacked = false - stat = "Average" - title = "LB: connections" - view = "timeSeries" } - type = "metric" - width = 9 - x = 9 - y = 101 }, - { - height = 8 - properties = { - query = <<-EOT SOURCE 'Forms' | SOURCE '/aws/lambda/Reliability' | SOURCE '/aws/lambda/Submission' | SOURCE '/aws/lambda/Nagware' | SOURCE '/aws/lambda/Response_Archiver' | SOURCE '/aws/lambda/Vault_Data_Integrity_Check' | fields @timestamp, @message, @logStream, @log | filter level = 'error' or level = 'warn' or status = 'failed' | filter @message not like /days since submission/ | sort @timestamp desc | limit 1000 EOT - region = "ca-central-1" - stacked = false - title = "Errors: app and lambdas" - view = "table" } - type = "log" - width = 20 - x = 0 - y = 64 }, - { - height = 8 - properties = { - alarms = [ - "arn:aws:cloudwatch:ca-central-1:687401027353:alarm:CpuUtilizationWarn", - "arn:aws:cloudwatch:ca-central-1:687401027353:alarm:MemoryUtilizationWarn", - "arn:aws:cloudwatch:ca-central-1:687401027353:alarm:HTTPCode_ELB_5XX_Count", - "arn:aws:cloudwatch:ca-central-1:687401027353:alarm:ResponseTimeWarn", - "arn:aws:cloudwatch:ca-central-1:687401027353:alarm:UnHealthyHostCount-TargetGroup1-SEV1", - "arn:aws:cloudwatch:ca-central-1:687401027353:alarm:UnHealthyHostCount-TargetGroup2-SEV1", - "arn:aws:cloudwatch:ca-central-1:687401027353:alarm:ReliabilityDeadLetterQueueWarn", ] - title = "Alarms" } - type = "alarm" - width = 4 - x = 20 - y = 64 }, - { - height = 2 - properties = { - background = "transparent" - markdown = <<-EOT # Performance EOT } - type = "text" - width = 24 - x = 0 - y = 72 }, - { - height = 7 - properties = { - query = <<-EOT SOURCE 'Forms' | fields @message | filter @message =~ /HealthCheck: cognito sign-up/ | parse @message "success" as @successCount | parse @message "failure" as @failureCount | stats count(@successCount) as Success, count(@failureCount) as Failed by bin(5m) EOT - region = "ca-central-1" -... ```
Show Conftest results ```sh WARN - plan.json - main - Missing Common Tags: ["aws_athena_data_catalog.dynamodb"] WARN - plan.json - main - Missing Common Tags: ["aws_athena_data_catalog.rds_data_catalog"] WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_event_rule.codedeploy_sns"] WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_log_group.notify_slack"] WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.ELB_5xx_error_warn"] WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.ELB_healthy_hosts"] WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.UnHealthyHostCount-TargetGroup1"] WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.UnHealthyHostCount-TargetGroup2"] WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.alb_ddos"] WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.api_cpu_utilization_high_warn[0]"] WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.api_lb_healthy_host_count[0]"] WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.api_lb_unhealthy_host_count[0]"] WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.api_memory_utilization_high_warn[0]"] WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.api_response_time_warn[0]"] WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.audit_log_dead_letter_queue_warn"] WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.cognito_login_outside_canada_warn"] WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.cognito_signin_exceeded"] WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.ddos_detected_forms_warn"] WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.ddos_detected_route53_warn[0]"] WARN - plan.json - main - Missing Common Tags:... ```