[Microsoft.Insights/scheduledQueryRules] Simultaneous deployment of 66+ resources fails

cedricbraekevelt commented 2 years ago

Bicep version Bicep CLI version 0.4.1124 (66c84c8ee5)

Describe the bug I'm creating log analytics schedulded query rules (Log Alerts V2) in a for loop. I've made a template for this and are providing the necessary parameters through a json datafile. All query rules are defined there. There are 72 items in there. When I run the for loop, nothing happens. No deployment starts in Azure, I only get there errors:

Error: Code=; Message=The request was canceled due to the configured HttpClient.Timeout of 100 seconds elapsing.
Error: Code=; Message=A task was canceled.
Error: Code=; Message=A task was canceled -The deployment validation failed However when I remove items untill I have 66 or less remaining the deployment succeeds. I have tried deploying with 66, then replace one item with a non deployed one, and the deployment succeeds aswell. Which leads me to believe this error has to do with the a for loop limit. However, i'm nowhere near 800 iterations with 72 items.

To Reproduce Create 'Microsoft.Insights/scheduledQueryRules@2021-08-01' in a loop with more then 66 items and the deployment will fail. There no parent loops or anything like that, this code is only being run ONCE.

Additional context

Log Alert V2 definition:

resource logAlerts 'Microsoft.Insights/scheduledQueryRules@2021-08-01' = [for logAlert in logAlertsArray: {
  name: logAlert.alertName
  location: resourceGroup().location
  kind: 'LogAlert'
  properties:{
    displayName: logAlert.alertName
    description: logAlert.alertDescription
    severity: logAlert.alertSeverity
    enabled: logAlert.isEnabled
    evaluationFrequency: logAlert.frequencyInMinutes
    windowSize: logAlert.timeWindowInMinutes
    autoMitigate: logAlert.autoMitigate
    criteria:{
      allOf:[
        {
          metricName: logAlert.alertName
          operator: logAlert.operator
          threshold: logAlert.threshold
          timeAggregation: logAlert.timeAggregation
          metricMeasureColumn: logAlert.metricMeasureColumn
          dimensions: [
            {
              name: logAlert.dimensionsName
              operator: logAlert.dimensionsOperator
              values: logAlert.dimensionsValues
          }
          ]
          query: logAlert.query
          failingPeriods:{
            numberOfEvaluationPeriods: logAlert.numberOfEvaluationPeriods
            minFailingPeriodsToAlert: logAlert.minFailingPeriodsToAlert
          }
        }
      ]
    }
    scopes:[
      loganalyticsworkspace.id
    ]
    actions:{
      actionGroups:[
        actiongroup.id
      ]
    }
  }
}]

One of my data items (of which I have 72)

        {
            "alertName": "IIS_Server_Services",
            "alertDescription": "IIS_Server_Service_Stopped",
            "query": "let Windows_Service_Names = dynamic(['IISADMIN','W3svc']); ConfigurationData | where SvcName in (Windows_Service_Names) and SvcState == \"Stopped\" and SvcStartupType  == \"Auto\"",
            "frequencyInMinutes": "PT5M",
            "timeWindowInMinutes": "PT5M",
            "operator": "GreaterThan",
            "threshold": 0,
            "alertSeverity": 2,
            "autoMitigate": true,
            "numberOfEvaluationPeriods": 1,
            "minFailingPeriodsToAlert": 1,
            "timeAggregation": "Count",
            "metricMeasureColumn": "",                    
            "dimensionsName": "Computer",
            "dimensionsOperator": "Include",
            "dimensionsValues" : ["*"],
            "isEnabled": false
        },

alex-frankel commented 2 years ago

This looks to be an issue with the resource provider not being able to handle the simultaneous request. I'd recommend opening a support case to have them look into it, but in the meantime, you can add a @batchSize() decorator to the resource to not do everything all at once like so:

@batchSize(20)
resource logAlerts 'Microsoft.Insights/scheduledQueryRules@2021-08-01' = [for logAlert in logAlertsArray: { ... } ]

Would be curious to see if that solves it.

slavizh commented 2 years ago

I have seen resources where they cannot handle parallel resource creation/updating after certain number. In most cases these are not documented and unlikely they will be fixed if they are not blocking issue for quite large number of customers. There are even resource that does not allow more than 1 resource of the same being created/updated at the same time. As always the fix is what Alex recommends and potentially those teams to document these limits either in Azure limits document for all services or in the ARM/API docs for the RP.

Kaloszer commented 1 year ago

Having a similar issue whilst deploying 100+ analytic rules with a loop. Did @batchSize(1) just to see if this helps but unfortunately not :(. @alex-frankel

Leaving a note as I'm investigating if this is my doing in the code somewhere during preprocessing of input data but for now I wish I had more info in the error message.

EDIT: By shrinking the input to only 10 records I was able to get a deployment to go through (well fail with an actual error message), so I guess batchsize is not going to help here. One workaround I guess would be to split deployment into smaller parts which is not optimal for IaaC.

anthony-c-martin commented 1 year ago

Updated the title to make this more discoverable

brwilkinson commented 1 year ago

@Kaloszer had shared below document. Which states the limit of 50 in a deployment.

49 (due to AR limitation of max 50 at a time! NB!: https://learn.microsoft.com/en-us/azure/sentinel/import-export-analytics-rules#:~:text=You%20can%20import%20up%20to%2050%20analytics%20rules%20from%20a%20single%20ARM%20template%20file.)

See at the bottom:

There may be a way to use a Module to work around the limit. A module is equal to a deployment, each time it is called. If you can pass in < 50 each iteration of the Module.

brwilkinson commented 1 year ago

maybe defer to @anthony-c-martin on a better way to slice() this array into smaller chunks or create a better lamda?

I think my math works on this one, as below.


param maxSubListSize int = 3
var list = [
  {
    name: 'Evie'
    age: 5
    interests: ['Ball', 'Frisbee']
  }
  {
    name: 'Casper'
    age: 3
    interests: ['Other dogs']
  }
  {
    name: 'Indy'
    age: 2
    interests: ['Butter']
  }
  {
    name: 'Kira'
    age: 8
    interests: ['Rubs']
  }
  {
    name: 'IndyDad'
    age: 10
    interests: ['Butter']
  }
  {
    name: 'KiraMum'
    age: 12
    interests: ['Rubs']
  }
]

var listLength = length(list)
var startIndexes = filter(range(0,listLength), item => item % maxSubListSize == 0)
var chunks = [for item in startIndexes: listLength >= item + maxSubListSize ? range(item, maxSubListSize) : range(item, listLength % maxSubListSize)]

module group 'foo2.bicep' = [for (items, index) in chunks: {
  name: 'group-${index}'
  params: {
    myArray: [for (item, index) in items: list[item] ]
  }
}]

output startIndexes array = startIndexes
output chunks array = chunks
output mychunks array = [for (items, index) in chunks: group[index].outputs.myArray]

foo2.bicep

param myArray array
output myArray array = myArray

brwilkinson commented 1 year ago

I tried this out with up to 320 rules at once and I saw some 429's aka throttling, however they retried within the deployment, so I couldn't get it to fail.

Adding sample code anyway to split up the deployments into batches of 50 which is documented limit.

reuse same query with just a different name

main. Bicep

param maxSubListSize int = 50
var ruleCount = 320
var ruleNameBase = 'testRule'

var ruleDefaultsTest = {
  location: 'eastus'
  alertDescription: 'New alert created via template'
  alertSeverity: 3
  isEnabled: true
  resourceId: resourceGroup().id
  query: 'AzureActivity | where OperationName == "Validate Deployment" | where Level == "Error"'
  metricMeasureColumn: 'AggregatedValue'
  operator: 'GreaterThan'
  threshold: '25'
  timeAggregation: 'Count'
}
var rules = [for (item, index) in range(1, ruleCount): union({alertName: 'ruleNameBase${item}'},ruleDefaultsTest)]

var listLength = length(rules)
var startIndexes = filter(range(0,listLength), item => item % maxSubListSize == 0)
var chunks = [for item in startIndexes: listLength >= item + maxSubListSize ? range(item, maxSubListSize) : range(item, listLength % maxSubListSize)]

module group 'scheduledQuery.bicep' = [for (items, index) in chunks: {
  name: 'group-${index}'
  params: {
    alerts: [for (item, index) in items: rules[item] ]
  }
}]

output startIndexes array = startIndexes
output chunks array = chunks
// output mychunks array = [for (items, index) in chunks: group[index].outputs.alerts]

// output TestRules array = rules
output TestRulesLength int = length(rules)

scheduledQuery.bicep

param alerts array

// defaults
param autoMitigate bool = false
param checkWorkspaceAlertsStorageConfigured bool = false
param resourceIdColumn string = 'id'
param numberOfEvaluationPeriods int = 1
param minFailingPeriodsToAlert int = 1
param windowSize string = 'PT1H'
param evaluationFrequency string = 'PT5M'
param muteActionsDuration string = 'PT5M'

resource queryRule 'Microsoft.Insights/scheduledQueryRules@2021-08-01' = [for alert in alerts : {
  name: alert.alertName
  location: alert.location
  tags: {}
  properties: {
    description: alert.alertDescription
    severity: alert.alertSeverity
    enabled: alert.isEnabled
    scopes: [
      alert.resourceId
    ]
    evaluationFrequency: evaluationFrequency
    windowSize: windowSize
    criteria: {
      allOf: [
        {
          query: alert.query
          // metricMeasureColumn: alert.metricMeasureColumn
          // resourceIdColumn: resourceIdColumn
          dimensions: []
          operator: alert.operator
          threshold: alert.threshold
          timeAggregation: alert.timeAggregation
          failingPeriods: {
            numberOfEvaluationPeriods: numberOfEvaluationPeriods
            minFailingPeriodsToAlert: minFailingPeriodsToAlert
          }
        }
      ]
    }
    muteActionsDuration: muteActionsDuration
    autoMitigate: autoMitigate
    checkWorkspaceAlertsStorageConfigured: checkWorkspaceAlertsStorageConfigured
    actions: {
      actionGroups: [
        // actionGroupId
      ]
      customProperties: {
        key1: 'value1'
        key2: 'value2'
      }
    }
  }
}]

output alerts array = alerts

Example of deployment.

deployment chunks with 50 each..

[
  [
    0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
    21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39,
    40, 41, 42, 43, 44, 45, 46, 47, 48, 49
  ],
  [
    50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68,
    69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87,
    88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99
  ],
  [
    100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114,
    115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129,
    130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144,
    145, 146, 147, 148, 149
  ],
  [
    150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164,
    165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179,
    180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194,
    195, 196, 197, 198, 199
  ],
  [
    200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214,
    215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229,
    230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244,
    245, 246, 247, 248, 249
  ],
  [
    250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263, 264,
    265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279,
    280, 281, 282, 283, 284, 285, 286, 287, 288, 289, 290, 291, 292, 293, 294,
    295, 296, 297, 298, 299
  ],
  [
    300, 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311, 312, 313, 314,
    315, 316, 317, 318, 319
  ]
]

@Bnetworx let me know if you are still interested to try this, I know it was from January. Given the docs mention max of 50, I am not sure if this is a bug. However in saying that it seems like they may have improved SLA for this. Either way I would just add the batching of deployments, then you can control to easily modify in the future or just leave with 50 per deployment.

Kaloszer commented 1 year ago

@brwilkinson I've now checked again, and was able to deploy 54 AR's without chunking, however going higher than that validation would timeout and I'd get presented with:

New-AzResourceGroupDeployment: 13:57:37 - Error: Code=; Message=The request was canceled due to the configured HttpClient.Timeout of 100 seconds elapsing.
New-AzResourceGroupDeployment: 13:57:37 - Error: Code=; Message=A task was canceled.
New-AzResourceGroupDeployment: 13:57:37 - Error: Code=; Message=A task was canceled.

After implementing the chunking you've provided I was able to run the deployment fully without cutting out any ARs, however I still got occasional rate limit errors:

Status Message: Rate limit of 200 per 30 seconds is exceeded (Code:BadRequest)
Status Message: Rate limit of 200 per 30 seconds is exceeded (Code:BadRequest)
Status Message: Rate limit of 200 per 30 seconds is exceeded (Code:BadRequest)

{
    "status": "Failed",
    "error": {
        "code": "BadRequest",
        "message": "Rate limit of 200 per 30 seconds is exceeded"
    }
}

These deployments did not attempt to retry and just lied dead. Is there something I'm missing that I should include in my bicep deployment file to make these retry?

To resolve this issue I added additional batching ontop of batching over the group deployment and this seems to have fixed it. 😂

batchsize(1)
module group 'scheduledQuery.bicep' = [for (items, index) in chunks: {
  name: 'group-${index}'
  params: {
    alerts: [for (item, index) in items: rules[item] ]
  }
}]

brwilkinson commented 1 year ago

@Kaloszer glad it's working.

Not sure if this is a common requirement, to require an array slice() or if there was a more simple syntax to cover this need?

Kaloszer commented 1 year ago

This is kind of a workaround for the @batchsize(1) handling not working for this particular case ( I suppose there might be more, but I haven't found any that would be similar), in the end more features never hurt, as long as they don't introduce more bugs :D.

On the other hand, shouldn't the root cause (rate limiting) here be fixed by the provider? Not sure whether that's the problem here.

brwilkinson commented 1 year ago

@Kaloszer hopefully when this feature moves out of preview they are able to scale to handle more throughput and remove the documented limit.

cedricbraekevelt commented 1 year ago

Hi @brwilkinson ,

Thank you for returning on this issue, however we are (sadly) no longer using Azure Monitor and as such are currently not coming across such issues I'm afraid. Other resources typically don't need to be deployed that many times (in our customerbase anyway).

brwilkinson commented 1 year ago

Thank you @Bnetworx and @Kaloszer for the follow up and feedback.

Azure / bicep-types-az