Azure / azure-monitor-baseline-alerts

Azure Monitor Baseline Alerts
MIT License
139 stars 205 forks source link

[Bug]: AGW Compute Units Alert and AGW Unhealthy Host Count Alert remain non-compliant after successful remediation #280

Closed Greg-Court closed 1 day ago

Greg-Court commented 1 month ago

Check for previous/existing GitHub issues

Description

In the initiative Deploy Azure Monitor Baseline Alerts for Landing Zone, the following policies remain non-compliant after remediating:

Remediation tasks complete, but resources remain non-compliant

image image

Landing zone itself was deployed via the terraform caf enterprise scale module, and the azure monitor baseline alerts were deployed via the CLI using the following command:

az deployment mg create --name "amba-GeneralDeployment" --template-uri  https://raw.githubusercontent.com/Greg-Court/azure-monitor-baseline-alerts/main/patterns/alz/alzArm.json --location "uksouth" --management-group-id "_redacted_" --parameters ./alzArm.param.json

A fork of the latest version of the AMBA repo was used.

Please don't hesitate to ask more questions, happy to provide any information that might help resolve the issue.

Deploy AGW Compute Units Alert

Compliance details

Compliance state Non-compliant

Last evaluated 15/07/2024, 15:01:23 BST

Definition version (preview) 1.0.0

Initiative version (preview) 1.0.0

Non-compliance message Alerting must be deployed to Azure services.

Reason for non-compliance No related resources match the effect details in the policy definition. Existence condition

Type Microsoft.Insights/metricAlerts

Last evaluated resource (out of 10) /subscriptions/redacted/resourcegroups/redacted/providers/Microsoft.Insights/metricAlerts/_redacted_agFailedRequests

Reason for non-compliance Current value must be equal to the target value.

Field Microsoft.Insights/metricAlerts/criteria.Microsoft.Azure.Monitor.MultipleResourceMultipleMetricCriteria.allOf[*].metricName

Path properties.criteria.allOf[*].metricName

Current value "FailedRequests"

Target value "UnhealthyHostCount"

Deploy AGW Unhealthy Host Count Alert

Compliance details

Compliance state Non-compliant

Last evaluated 15/07/2024, 15:01:23 BST

Definition version (preview) 1.0.0

Initiative version (preview) 1.0.0

Reason for non-compliance Current value must be equal to the target value.

Field Microsoft.Insights/metricAlerts/criteria.Microsoft.Azure.Monitor.MultipleResourceMultipleMetricCriteria.allOf[*].metricName

Path properties.criteria.allOf[*].metricName

Current value "FailedRequests"

Target value "ComputeUnits"

arjenhuitema commented 1 month ago

Hi @Greg-Court,

Thanks for reporting the issue. I will review the issue together with issue #278 that you reported and get back to you.

Greg-Court commented 1 month ago

Here is the terraform code for the app gateway in question (this might help reproducing the bug), with only names changed:

resource "azurerm_public_ip" "appgw" {
  name                = "pip-appgw-shared-uks-01"
  resource_group_name = azurerm_resource_group.myapp.name
  location            = azurerm_resource_group.myapp.location
  sku                 = "Standard"
  allocation_method   = "Static"
  provider            = azurerm.myapp
  zones = ["1", "2", "3"]
}

locals {
  backend_address_pool_name      = "${azurerm_virtual_network.myapp.name}-beap"
  frontend_port_name             = "${azurerm_virtual_network.myapp.name}-feport"
  frontend_ip_configuration_name = "${azurerm_virtual_network.myapp.name}-feip"
  http_setting_name              = "${azurerm_virtual_network.myapp.name}-be-htst"
  listener_name                  = "${azurerm_virtual_network.myapp.name}-httplstn"
  request_routing_rule_name      = "${azurerm_virtual_network.myapp.name}-rqrt"
  redirect_configuration_name    = "${azurerm_virtual_network.myapp.name}-rdrcfg"
}

resource "azurerm_application_gateway" "network" {
  name                = "el-appgw-shared-uks-01"
  resource_group_name = azurerm_resource_group.myapp.name
  location            = azurerm_resource_group.myapp.location
  provider            = azurerm.myapp

  zones = ["1", "2", "3"]

  sku {
    name     = "WAF_v2"
    tier     = "WAF_v2"
  }

  autoscale_configuration {
    min_capacity = 1
    max_capacity = 10
  }

  gateway_ip_configuration {
    name      = "appGatewayIpConfig"
    subnet_id = azurerm_subnet.myapp_appgw.id
  }

  frontend_port {
    name = local.frontend_port_name
    port = 80
  }

  frontend_ip_configuration {
    name                 = local.frontend_ip_configuration_name
    public_ip_address_id = azurerm_public_ip.appgw.id
  }

  backend_address_pool {
    name = local.backend_address_pool_name
  }

  backend_http_settings {
    name                  = local.http_setting_name
    cookie_based_affinity = "Disabled"
    path                  = "/path1/"
    port                  = 80
    protocol              = "Http"
    request_timeout       = 60
  }

  http_listener {
    name                           = local.listener_name
    frontend_ip_configuration_name = local.frontend_ip_configuration_name
    frontend_port_name             = local.frontend_port_name
    protocol                       = "Http"
  }

  request_routing_rule {
    name                       = local.request_routing_rule_name
    priority                   = 9
    rule_type                  = "Basic"
    http_listener_name         = local.listener_name
    backend_address_pool_name  = local.backend_address_pool_name
    backend_http_settings_name = local.http_setting_name
  }

  waf_configuration {
    enabled = true
    firewall_mode    = "Prevention"
    rule_set_version = "3.2"
  }
  tags = var.default_tags
}
arjenhuitema commented 1 month ago

Hi @Greg-Court

Spotted the problems and we've got the solutions up in our dev branch. Check it out:

For both instances, the error was within the resource provider path in the existence condition.

I'll post an update when we plan to merge these updates into the main branch.