Azure / azure-monitor-baseline-alerts

Azure Monitor Baseline Alerts
MIT License
139 stars 204 forks source link

[Question]: How to override the default alert threshold values for specific resource names? #272

Closed chaoscreater closed 6 hours ago

chaoscreater commented 1 month ago

Check for previous/existing GitHub issues

Description

Let's say I want to change the memory threshold of a specific App Service Plan from the default 85% defined in the policy definition, to e.g. 95%. This threshold should only apply to just this one ASP resource. The other ASP resource would just use the 85%.

The problem is that the policy definition that configures this is part of a policy set (initiative). And the policy assignment is for this policy set. This means that I can't just go an edit the policy definition and create an assignment just for this. I need to modify the policy definition, policy set and the assignment for the policy set. This becomes rather messy and cumbersome to manage.

What's the best practice for situations like this?

Brunoga-MS commented 1 month ago

Hello @chaoscreater, thanks for your question. Unfortunately, the concept of override does not exist in Azure Monitor. The only possible workaround is to have resource-dedicated alerts which conflicts with the 'at scale' approach. Given that you can change the threshold in the param file, redeploy and given the threshold being part of the existence condition, the new value will be applied. But again, this is going to be applied to the alert scope (meaning all the resources and not for specific ones).

Modifying the policy definition is something possible and it would require you to change the query to dynamically identified threshold based on resource name(s) provided that you have a naming convention or tagging. You can see an example of the dynamic threshold logic in this post Azure Monitor: Use Dynamic Thresholds in Log Alerts. Querying Azure Resource Graph in the same query will allow you to leverage tagging to identify the resource.

Hope that helps.

Thanks, Bruno.

chaoscreater commented 1 month ago

@Brunoga-MS - thanks for your quick response.

Problem with updating the policy definition is that if MS publishes new updates, then we'll have to sync them to our repo and manually compare each file for individual changes, which isn't really ideal.

And yes, updating the parameter will update it for all resources in the scope, which we do not want.

In an ideal world, resources of the same type (e.g. App Service Plans) would all have the same alert threshold values, which would make baseline alerts that are deployed at scale seem like a good solution. But in reality, some resources hold less importance or priority compared to another resource of the same type. Problem is, we can't just exclude the policy assignment for resource XYZ, because that would mean all the other policy definitions relevant to that resource type would be excluded as well, since they belong to the same assignment.

I think it would have been better for MS to create policy assignment per policy definition, rather than having policy assignment created for a policy set. It would be much easier to exclude a resource from a policy assignment that is tied to one policy definition. We can then create a custom policy definition and assignment with settings unique to that one resource. I think this would be a clean approach, as the resource is excluded from just that one policy definition, while it still applies all the other policy definitions.

Brunoga-MS commented 1 month ago

@chaoscreater : Got your point. Exclusion is possible via tagging the resources to be excluded from the remediation as documented at Disabling Policies. If remediation was already completed, you have to tag the resource and delete the existing alert so it won't be recreated. A that point you can just assign the current policy definition using your custom values.

I would also ask @arjenhuitema to comment on your suggestion.

Hope that helps.

Thanks, Bruno.

chaoscreater commented 1 month ago

@Brunoga-MS - I think the tag you're referring to is MonitorDisable? But that would exclude remediation for multiple policy definitions for said resource. Let's say I want most of the AMBA baseline alerts for App Service Plan to apply to resource A, but I want a specific alert (e.g. App Service Plan memory) to be excluded. Well, applying the tag will work, but this would also exclude the CPU alert from being applied to the App Service Plan, since the policy definition for ASP CPU also references the same MonitorDisable tag. I would have to change multiple policy definitions to use a different tag for exclusion.

Sometimes, a resource might need a slightly different threshold for just one specific metric (e.g. CPU, memory) and keep the rest of the metrics on the default AMBA baseline thresholds. I think there needs to be enough flexibility here, as it's not always the case that the baseline recommended by MS is always applicable for all customers and all environments and situations. There needs to be enough flexibility. While I can modify the policy definitions, but keeping the deltas up to date with changes pushed by Microsoft is going to be rather difficult and time-consuming to manage.

chaoscreater commented 1 month ago

For now, I got it working by using tags as you've suggested.

I modified the policy definition for the resource type in question, in my case it's the App Service Plan. I added a logic to check if the tag "Custom_Alert_Memory_Threshold" exists on the resource. If yes, it will use the value associated with that tag, otherwise it'll just use the default threshold value that was already defined in the policy definition.

I didn't need to add any parameters in the assignment and didn't touch the policy set definition. All I needed to do was update the policy definition with a few simple lines, then add the tag to a existing resource with the value I want. The benefit of this approach is that if I run my script to delete all AMBA related resources (based on the deployed_by_amba tag), then I can always re-deploy them easily and my custom alert thresholds will still apply.

The other benefit is that if an alert was already deployed for a resource, then after I add the tag to that resource, remediation task will modify the threshold of the existing alert to what the tag value is. This means I don't need to go and delete the existing alert anymore!

In case anyone has a similar issue, here's my policy definition for reference: https://pastebin.com/9CvWEDx8

arjenhuitema commented 1 month ago

Hi @chaoscreater It's good to hear you've resolved your issue and appreciate you posting it here. Could you please include the code snippets here on this issue?

I think it would have been better for MS to create policy assignment per policy definition, rather than having policy assignment created for a policy set. It would be much easier to exclude a resource from a policy assignment that is tied to one policy definition. We can then create a custom policy definition and assignment with settings unique to that one resource. I think this would be a clean approach, as the resource is excluded from just that one policy definition, while it still applies all the other policy definitions.

That approach has several disadvantages, and in nearly all cases it isn't feasible due to a 200 assignment limit per scope, hence the necessity to group policies into initiatives. Please see Azure Policy Limits

SteveBurkettNZ commented 1 month ago

Yeah, our problem is that we'd like to set our non-prod resource thresholds differently to our production resource thresholds, as for instance, we're happy for our dev/test instances to use >85% of the max available RAM if it means we don't have to bump up a SKU which would double the cost.

They may go just over 90% usage every now and then and maybe run a bit slower/longer because of it, which is fine. But we'd still like to be alerted when that dev/test instance is reeeeally getting low on RAM so don't want to disable it completely.

Brunoga-MS commented 1 month ago

hello @chaoscreater , any news about adding the code snippet instead of the containing URL?

Thanks, Bruno.

chaoscreater commented 1 month ago

Hi @Brunoga-MS, here it is below:

{
  "$schema": "https://raw.githubusercontent.com/Azure/enterprise-azure-policy-as-code/main/Schemas/policy-definition-schema.json",
  "name": "Deploy_WSF_MemoryPercentage_Alert",
  "properties": {
    "displayName": "AMBA - Deploy App Service Plan Memory Percentage Alert",
    "description": "Policy to audit/deploy App Service Plan Memory Percentage  Alert",
    "mode": "All",
    "metadata": {
      "version": "1.1.0",
      "source": "https://github.com/Azure/azure-monitor-baseline-alerts/",
      "Category": "Web Services",
      "_deployed_by_amba": "True"
    },
    "parameters": {
      "evaluationFrequency": {
        "allowedValues": [
          "PT1M",
          "PT5M",
          "PT15M",
          "PT30M",
          "PT1H"
        ],
        "defaultValue": "PT5M",
        "metadata": {
          "description": "Evaluation frequency for the alert",
          "displayName": "Evaluation Frequency"
        },
        "type": "String"
      },
      "autoMitigate": {
        "allowedValues": [
          "true",
          "false"
        ],
        "defaultValue": "true",
        "metadata": {
          "description": "Auto Mitigate for the alert",
          "displayName": "Auto Mitigate"
        },
        "type": "String"
      },
      "windowSize": {
        "allowedValues": [
          "PT1M",
          "PT5M",
          "PT15M",
          "PT30M",
          "PT1H",
          "PT6H",
          "PT12H",
          "P1D"
        ],
        "defaultValue": "PT5M",
        "metadata": {
          "description": "Window size for the alert",
          "displayName": "Window Size"
        },
        "type": "String"
      },
      "enabled": {
        "allowedValues": [
          "true",
          "false"
        ],
        "defaultValue": "true",
        "metadata": {
          "description": "Alert state for the alert",
          "displayName": "Alert State"
        },
        "type": "String"
      },
      "severity": {
        "allowedValues": [
          "0",
          "1",
          "2",
          "3",
          "4"
        ],
        "defaultValue": "2",
        "metadata": {
          "description": "Severity of the Alert",
          "displayName": "Severity"
        },
        "type": "String"
      },
      "threshold": {
        "defaultValue": "85",
        "metadata": {
          "description": "Threshold for the alert",
          "displayName": "Threshold"
        },
        "type": "String"
      },
      "monitorDisable": {
        "defaultValue": "MonitorDisable",
        "metadata": {
          "description": "Tag name to disable monitoring resource. Set to true if monitoring should be disabled",
          "displayName": "Effect"
        },
        "type": "String"
      },
      "effect": {
        "allowedValues": [
          "deployIfNotExists",
          "disabled"
        ],
        "defaultValue": "deployIfNotExists",
        "metadata": {
          "description": "Effect of the policy",
          "displayName": "Effect"
        },
        "type": "String"
      }
    },
    "policyRule": {
      "if": {
        "allOf": [
          {
            "equals": "Microsoft.Web/serverfarms",
            "field": "type"
          },
          {
            "field": "[concat('tags[', parameters('MonitorDisable'), ']')]",
            "notEquals": "true"
          }
        ]
      },
      "then": {
        "effect": "[parameters('effect')]",
        "details": {
          "type": "Microsoft.Insights/metricAlerts",
          "roleDefinitionIds": [
            "/providers/Microsoft.Authorization/roleDefinitions/b24988ac-6180-42a0-ab88-20f7382dd24c"
          ],
          "existenceCondition": {
            "allOf": [
              {
                "equals": "Microsoft.Web/serverfarms",
                "field": "Microsoft.Insights/metricAlerts/criteria.Microsoft.Azure.Monitor.MultipleResourceMultipleMetricCriteria.allOf[*].metricNamespace"
              },
              {
                "equals": "MemoryPercentage",
                "field": "Microsoft.Insights/metricAlerts/criteria.Microsoft.Azure.Monitor.MultipleResourceMultipleMetricCriteria.allOf[*].metricName"
              },
              {
                "equals": "[concat(subscription().id, '/resourceGroups/', resourceGroup().name, '/providers/Microsoft.Web/serverfarms/', field('fullName'))]",
                "field": "Microsoft.Insights/metricalerts/scopes[*]"
              },
              {
                "equals": "[parameters('enabled')]",
                "field": "Microsoft.Insights/metricAlerts/enabled"
              },
              {
                "equals": "[parameters('evaluationFrequency')]",
                "field": "Microsoft.Insights/metricAlerts/evaluationFrequency"
              },
              {
                "equals": "[parameters('windowSize')]",
                "field": "Microsoft.Insights/metricAlerts/windowSize"
              },
              {
                "equals": "[parameters('severity')]",
                "field": "Microsoft.Insights/metricalerts/severity"
              },
              {
                "equals": "[parameters('autoMitigate')]",
                "field": "Microsoft.Insights/metricAlerts/autoMitigate"
              },
              {
                "equals": "Average",
                "field": "Microsoft.Insights/metricAlerts/criteria.Microsoft-Azure-Monitor-SingleResourceMultipleMetricCriteria.allOf[*].timeAggregation"
              },
              {
                "equals": "GreaterThan",
                "field": "Microsoft.Insights/metricAlerts/criteria.Microsoft.Azure.Monitor.MultipleResourceMultipleMetricCriteria.allOf[*].StaticThresholdCriterion.operator"
              },
//              {
//                "equals": "[parameters('threshold')]",
//                "field": "Microsoft.Insights/metricAlerts/criteria.Microsoft.Azure.Monitor.MultipleResourceMultipleMetricCriteria.allOf[*].StaticThresholdCriterion.threshold"
//              },
              {
                "equals": "[if(contains(field('tags'), 'Custom_Alert_Memory_Threshold'), field('tags.Custom_Alert_Memory_Threshold'), parameters('threshold'))]",
                "field": "Microsoft.Insights/metricAlerts/criteria.Microsoft.Azure.Monitor.MultipleResourceMultipleMetricCriteria.allOf[*].StaticThresholdCriterion.threshold"
              }
            ]
          },
          "deployment": {
            "properties": {
              "parameters": {
                "resourceId": {
                  "value": "[field('id')]"
                },
                "evaluationFrequency": {
                  "value": "[parameters('evaluationFrequency')]"
                },
                "autoMitigate": {
                  "value": "[parameters('autoMitigate')]"
                },
                "windowSize": {
                  "value": "[parameters('windowSize')]"
                },
                "enabled": {
                  "value": "[parameters('enabled')]"
                },
                "severity": {
                  "value": "[parameters('severity')]"
                },
                "threshold": {
                  "value": "[if(contains(field('tags'), 'Custom_Alert_Memory_Threshold'), field('tags.Custom_Alert_Memory_Threshold'), parameters('threshold'))]"
                },
                "resourceName": {
                  "value": "[field('name')]"
                }
              },
              "template": {
                "parameters": {
                  "resourceId": {
                    "metadata": {
                      "description": "Resource ID of the resource emitting the metric that will be used for the comparison",
                      "displayName": "resourceId"
                    },
                    "type": "String"
                  },
                  "evaluationFrequency": {
                    "type": "String"
                  },
                  "autoMitigate": {
                    "type": "String"
                  },
                  "windowSize": {
                    "type": "String"
                  },
                  "enabled": {
                    "type": "String"
                  },
                  "severity": {
                    "type": "String"
                  },
                  "threshold": {
                    "type": "String"
                  },
                  "resourceName": {
                    "metadata": {
                      "description": "Name of the resource",
                      "displayName": "resourceName"
                    },
                    "type": "String"
                  }
                },
                "contentVersion": "[1.0.0.0](http://1.0.0.0/)",
                "resources": [
                  {
                    "type": "Microsoft.Insights/metricAlerts",
                    "properties": {
                      "description": "Metric Alert for App Service Plan Memory Percentage",
                      "evaluationFrequency": "[parameters('evaluationFrequency')]",
                      "autoMitigate": "[parameters('autoMitigate')]",
                      "parameters": {
                        "evaluationFrequency": {
                          "value": "[parameters('evaluationFrequency')]"
                        },
                        "autoMitigate": {
                          "value": "[parameters('autoMitigate')]"
                        },
                        "windowSize": {
                          "value": "[parameters('windowSize')]"
                        },
                        "enabled": {
                          "value": "[parameters('enabled')]"
                        },
                        "severity": {
                          "value": "[parameters('severity')]"
                        },
                        "threshold": {
                          "value": "[parameters('threshold')]"
                        }
                      },
                      "windowSize": "[parameters('windowSize')]",
                      "enabled": "[parameters('enabled')]",
                      "severity": "[parameters('severity')]",
                      "criteria": {
                        "allOf": [
                          {
                            "threshold": "[parameters('threshold')]",
                            "timeAggregation": "Average",
                            "name": "MemoryPercentage",
                            "operator": "GreaterThan",
                            "metricNamespace": "Microsoft.Web/serverfarms",
                            "criterionType": "StaticThresholdCriterion",
                            "metricName": "MemoryPercentage"
                          }
                        ],
                        "odata.type": "Microsoft.Azure.Monitor.SingleResourceMultipleMetricCriteria"
                      },
                      "scopes": [
                        "[parameters('resourceId')]"
                      ]
                    },
                    "apiVersion": "2018-03-01",
                    "location": "global",
                    "name": "[concat('AMBA-', parameters('resourceName'), '-MemoryPercentage')]",
                    "tags": {
                      "_deployed_by_amba": true
                    }
                  }
                ],
                "variables": {},
                "$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#"
              },
              "mode": "incremental"
            }
          }
        }
      }
    }
  }
}
Brunoga-MS commented 1 month ago

hi @chaoscreater , thanks for sharing the code snippet. We were thinking about using the same approach but we stopped since this is only applicable to those Metrics and Log alerts which are one per resource. In case of VM alerts, for instance, where you have 1 alert covering more than 1 VM (or hybrid vm) the override will be applied to all resources making that 'not an override' but a new default value. However, we will investigate more to see if we can find something working in scenarios.

Thanks, Bruno.

chaoscreater commented 1 month ago

@Brunoga-MS - sorry, I'm not sure I follow. Could you give an example of a VM alert?

In our case, the metric alerts are what we're trying to set a custom threshold for. If I have 3 App Service Plans and I want custom memory thresholds of e.g. 75%, 80% and 95% for them, I can just set the same tag but with different values. And I'll just need to use one policy definition and policy assignment, no need to pass in parameters. The definition just reads the tag and the value and applies it.

For VM alerts, we might sometimes see e.g. VM deallocated, or VM unavailable, or User Initiated Restart of VM or something like that. We can supress these by just creating a custom policy definition that uses an Alert Supressing Rule and then specify the names of the resources. But I think with the tagging method, it could also work, as it can just check if a tag with a specific key-value exists, then suppress it (or do whatever).

Brunoga-MS commented 1 month ago

@chaoscreater , in AMBA-ALZ metrics alerts are created one per resource and it is easier to override the threshold for the single resource using your tag method. Log-based alerts (or some some of them like the VM or HybridVM alerts) are created only once and apply to all resources in scope. If you look at the heartbeat alert, you will see only one looking at Heartbeat table records in a Log Analytics Workspace. In this case the alert configuration is a single one applied to many resources, hence the override will be applied to many resources as well. Does this make sense?

Thanks, Bruno.

chaoscreater commented 1 month ago

@Brunoga-MS - Ah yes, I get what you mean now.

I think in this case, we may need to modify the policy definition to exclude resources that we don't want to apply to. Perhaps we could modify the tag "MonitorDisable" to "VM_Heartbeat_Alert_Disable" or something like that. Essentially, this tag will only be applicable to this policy definition. Then we can apply this tag to the VM resources that we want to exclude. Finally, create a custom policy definition for VM heartbeat and apply to the excluded VMs.

The other policy definitions that are part of a initiative policy set will still apply to this VM. But this approach would mean we may potentially have lots of tags to manage :(