Add "wait" and "retry" deployment options

rshariy commented 3 years ago

ARM template deployment often fails with errors like:

"Another operation is in progress on the selected item. If there is an in-progress operation, please retry after it has finished."

"BMSUserErrorObjectLocked","message":"Another operation is in progress on the selected item."

Just to clarity - this is not a dependency issue. ARM deployment may fail if ,for example, you try to add a VM to an RSV and there is another VM being added at the same time: for a few seconds RSV will not accept new clients and as the result your deployment will fail.

Would like to have an option to pause deployment and/or retry it - may be introduce the "wait" and "retry" deployment conditions, i.e:

resource blob 'Microsoft.Storage/storageAccounts/blobServices/containers@2019-06-01' = {
    wait: 30
    retry: 5
    name: '${stg.name}/default/logs'
}

alex-frankel commented 3 years ago

Understood. This is something we have been considering, but haven't scheduled the work yet. If you (or others) have other examples that you have run into, it would be great to capture those here.

I know RBAC replication (and replication delays in general) are another place where something like this would be helpful.

anthony-c-martin commented 3 years ago

I know RBAC replication (and replication delays in general) are another place where something like this would be helpful.

@alex-frankel I'm assuming this is something we're planning on also addressing in the underlying platform? This feels like a leaky abstraction, not something that the end-user should have to deal with by adding delays.

alex-frankel commented 3 years ago

This feels like a leaky abstraction, not something that the end-user should have to deal with by adding delays.

Agreed. @bmoore-msft and I were also discussing this yesterday. Ideally, ARM will co-locate all the calls end-to-end so a user never has to think about this. Not sure if/when that will be possible, and this may be a necessary evil in the meantime.

bmoore-msft commented 3 years ago

The OP doesn't sound like replication (feels like concurrency) though I could see that you could potentially address both with something like retry. The problem in this case (or either really case) is indefinite postponement. This feels like a problem with the RP - common operations returning frequent 400s instead of maybe 429.

The challenge with this workaround is not only does the user have to fail, then implement a non-deterministic work around (that's expensive on the service) it will mask problems with across ARM, RPs and user code.

@rshariy - have you raised this issue with the RSV team? It doesn't appear to be an uncommon problem and seems like it should be addressed by the RSV... either it shouldn't happen or we're not helping customer figure out how to effectively use RSV.

rshariy commented 3 years ago

@bmoore-msft I raised a similar issue with the Azure Firewall product team about a year ago - the only solution we found is to use a PowerShell function to check Azure FW status (make sure it is not "updating") before kicking-off new ARM deployment to FW.

Just logged ticket 120120226003381 about the RSV issue - lets see what MS support will come up with.

alex-frankel commented 3 years ago

it will mask problems with across ARM, RPs and user code.

this point is what gives us caution on implementing something like this. We have some potential solutions to deal with the replication delay in particular that we will explore before introducing a wait.

@rshariy - please let us know the resolution of the case.

Agazoth commented 3 years ago

I have a main template that looks like this:

module kv 'keyvault.bicep' = {
  name: 'kvSmoketestDeploy'
  scope: rg
  params: {
    keyVaultName: keyVaultName
    enableSoftDelete: false
  }
}

module kvaccpol 'keyvaultaccesspolicy.bicep' = {
  name: 'kvAccPolSmoketestDeploy'
  scope: rg
  params: {
    keyVaultName: keyVaultName
    action: 'add'
    objectId: objectId
    access: keyVaultAccessPolicyAccess
  }
}

When that runs, the deployment breaks with:

{
   "error": {
     "code": "ParentResourceNotFound",
     "message": "Can not perform requested operation on nested resource. Parent resource 'kv-kvaccpoltest' not found."
   }
} (Code:NotFound)

Running the deployment again, deploys the policy

eja-git commented 3 years ago

I ran into a scenario where I'd like a wait, not much code to show, basically deploying a FunctionApp, then want to output the default key for use in Api Management. The problem is the function app takes some time to spin up before the app keys are present...

resource functionApp 'Microsoft.Web/sites@2020-06-01' = {
  name: functionAppName
  location: location
  kind: 'functionapp'
...

output functionappdefaultkey string = listKeys('${functionApp.id}/host/default', functionApp.apiVersion).functionKeys.default

Workaround is to run the initial deployment of the function app twice.

bmoore-msft commented 3 years ago

@eja-git this isn't a "wait" scenario, it's bug in the deployment engine job scheduling... the listKeys job is scheduled too early... so that's the fix for your particular scenario.

Pietervanhove commented 3 years ago

Hi,

I've logged the following issue https://github.com/projectkudu/kudu/issues/3312#issuecomment-870741730 that could also benefit from the wait option during a deployment.

Best Regards Pieter

azMantas commented 3 years ago

I am trying to simplify firewall rule collection deploying by using loadTextContent and then loop from each variable. workload-x.json contains all properties for rule collection.

var workloads = [
  json(loadTextContent('./workload-1.json'))
  json(loadTextContent('./workload-2.json'))
  json(loadTextContent('./workload-3.json'))
]

resource afwPolicy 'Microsoft.Network/firewallPolicies@2021-02-01' existing = {
  name: 'bicepRules'
}

resource collectionGroups 'Microsoft.Network/firewallPolicies/ruleCollectionGroups@2021-02-01' = [for workload in workloads: {
  name: workload.name
  parent: afwPolicy
  properties: workload.properties
}]

here is the error I get

Rule Collection Group workload-2 can not be updated because Parent Firewall Policy bicepRules is in Updating state from previous operation

I am sure that a short delay between deployments would help us to loop through all array

SenthuranSivananthan commented 3 years ago

Only one Rule Collection Group can be updated at a time with Azure Firewall Policy. Since the update refreshes all of the connected Azure Firewall instances, the amount of time it takes to update is non-deterministic. Therefore you will need to serialize the deployment using the batchSize decorator.

Can you try:

@batchSize(1)
resource collectionGroups 'Microsoft.Network/firewallPolicies/ruleCollectionGroups@2021-02-01' = [for workload in workloads: {
  name: workload.name
  parent: afwPolicy
  properties: workload.properties
}]

SQLDBAWithABeard commented 3 years ago

I have two scenarios that come to mind from recent experience.

Overarching enterprise management level policy being applied to a resource that has been created which I reference in next resource/module causing the Another Operation error. A retry would be useful here as I have no control or influence over the Policies.

I have also faced situations where a newly created resource is not available when referenced immediately afterwards which I assume is a replication/caching issue as the next run works flawlessly.

wsucoug69 commented 3 years ago

My scenario includes creating a Cosmos Account, this typically takes a few minutes and sometimes up to 10 minutes. In this case I am unable to use the resource output to set the connection string for use in subsequent modules e.g. passing into keyVault and functionAppSettings

alex-frankel commented 3 years ago

My scenario includes creating a Cosmos Account, this typically takes a few minutes and sometimes up to 10 minutes.

@markjbrown - do you mind taking a look at this one? I'd expect the Cosmos Account not to report complete until it is fully provisioned. @zapadoody -- do you happen to have the code sample of the repro and a correlation ID when the error occured?

markjbrown commented 3 years ago

For run-time deployment errors you should raise a support ticket as they are best equipped to diagnose specific errors with an activity id.

However I am happy to look at an existing bicep file though to see if there are any issues.

I do have a sample on how to output the endpoint and key from a Cosmos account and input into appSettings for an App Service here if that helps.

https://github.com/Azure/azure-quickstart-templates/blob/master/quickstarts/microsoft.documentdb/cosmosdb-webapp/main.bicep

wsucoug69 commented 3 years ago

here's my cosmosAccount.bicep

param location string
param cosmosAccountName string
param cosmosDefaultConsistencyPolicy string 
param cosmosPrimaryRegion string
param cosmosSecondaryRegion string

var lowerCosmosAcctName = toLower(cosmosAccountName)
var locations = [
  {
    locationName: cosmosPrimaryRegion
    failoverPriority: 0
    isZoneRedundant: false
  }
  {
    locationName: cosmosSecondaryRegion
    failoverPriority: 1
    isZoneRedundant: false
  }
]

resource cosmosAccountResource 'Microsoft.DocumentDB/databaseAccounts@2021-06-15' = {
  name: lowerCosmosAcctName
  kind: 'GlobalDocumentDB'
  location: location
  properties: {
    locations: locations
    databaseAccountOfferType: 'Standard'
    enableAutomaticFailover: true
    consistencyPolicy: {
      defaultConsistencyLevel: cosmosDefaultConsistencyPolicy
    }
  }
}

output cosmosAccountResourceName string = cosmosAccountResource.name

here's the KeyVault.bicep

param location string 
param keyVaultName string
param productionPrincipalId string
param productionTenantId string
param stagingPrincipalId string
param stagingTenantId string

@secure()
param cosmosPrimaryConnectionString string

@secure()
param cosmosSecondaryConnectionString string

@secure()
param serviceStorageConnectionString string

@secure()
param appStorageConnectionString string

resource keyVault 'Microsoft.KeyVault/vaults@2019-09-01' = {
  name: keyVaultName
  location: location
  properties: {
    enabledForDeployment: true
    enabledForTemplateDeployment: true
    enabledForDiskEncryption: true
    tenantId: productionTenantId
    accessPolicies: [
      {
        tenantId: productionTenantId
        objectId: productionPrincipalId
        permissions: {
          secrets: [
            'get'
            'list'
          ]
        }
      }
      {
        tenantId: stagingTenantId
        objectId: stagingPrincipalId
        permissions: {
          secrets: [
            'get'
            'list'
          ]
        }
      }
    ]
    sku: {
      name: 'standard'
      family: 'A'
    }
  }  
}

resource cosmosPrimaryConnectionStringSecret 'Microsoft.KeyVault/vaults/secrets@2019-09-01' = {
  name: '${keyVaultName}/cosmosPrimaryConnectionString'
  properties: {
    value: cosmosPrimaryConnectionString
  }
  dependsOn:[
    keyVault
  ]
}

resource cosmosSecondaryConnectionStringSecret 'Microsoft.KeyVault/vaults/secrets@2019-09-01' = {
  name: '${keyVaultName}/cosmosSecondaryConnectionString'
  properties: {
    value: cosmosSecondaryConnectionString
  }
  dependsOn:[
    keyVault
  ]
}

resource serviceStorageConnectionStringSecret 'Microsoft.KeyVault/vaults/secrets@2019-09-01' = {
  name: '${keyVaultName}/dbConnectionString'
  properties: {
    value: serviceStorageConnectionString
  }
  dependsOn:[
    keyVault
  ]
}

resource appStorageConnectionStringSecret 'Microsoft.KeyVault/vaults/secrets@2019-09-01' = {
  name: '${keyVaultName}/appStorageConnectionString'
  properties: {
    value: appStorageConnectionString
  }
  dependsOn:[
    keyVault
  ]
}

output appStorageConnectionStringUri string = appStorageConnectionStringSecret.properties.secretUri
output serviceStorageConnectionStringUri string = serviceStorageConnectionStringSecret.properties.secretUri
output cosmosPrimaryConnectionStringUri string = cosmosPrimaryConnectionStringSecret.properties.secretUri
output cosmosSecondaryConnectionStringUri string = cosmosSecondaryConnectionStringSecret.properties.secretUri

and here's the main.bicep

/// cosmos db account, database and container module
module cosmosAccountMod '../cosmosAccount.bicep' = {
  name: 'cosmosAccount-${environmentName}-${buildNumber}'
  params: {
    cosmosAccountName: cosmosAccountName
    cosmosDefaultConsistencyPolicy: cosmosDefaultConsistencyPolicy
    cosmosPrimaryRegion: cosmosPrimaryRegion
    cosmosSecondaryRegion: cosmosSecondaryRegion
    location: location
  }
}

module cosmosDatabaseMod '../cosmosDbContainer.bicep' = {
  name: 'cosmosDBContainer-${environmentName}-${buildNumber}'
  params: {
    cosmosAccountName: cosmosAccountMod.outputs.cosmosAccountResourceName
    cosmosContainerName: cosmosContainerName
    cosmosDatabaseName: cosmosDatabaseName
    cosmosThroughput: cosmosThroughput
  }
  dependsOn: [
    cosmosAccountMod
  ]
}

// storage account module - storage for the tenants application 
module appStorageAccountMod '../storageAccount.bicep' = {
  name: 'appStorageAcctName-${environmentName}-${buildNumber}'
  params: {
    storageAcctName: appStorageAcctName
    storageSkuName: appStorageAcctSku
    location: location
  }
}

// app insights module
module appInsightsMod '../appInsights.bicep' = {
  name: 'appInsightsName-${environmentName}-${buildNumber}'
  params: {
    name: appInsightsName
    resourceGroupLocation: location
  }
}

// app service plan module
module appServicePlanMod '../appServicePlan.bicep' = {
  name: 'appServicePlan-${environmentName}-${buildNumber}'
  params: {
    appSvcPlanSku: appSvcPlanSku
    appSvcPlanTier: appSvcPlanTier
    appSvcPlanName: appSvcPlanName
    appPlanLocation: location
  }
}

// function app module
module functionAppMod '../functionApp.bicep' = {
  name: 'functionApp-${environmentName}-${buildNumber}'
  params: {
    appSvcPlanName: appSvcPlanName
    functionAppName: functionAppName
    location: location
  }
  dependsOn: [
    appStorageAccountMod
    appServicePlanMod
    cosmosAccountMod
  ]
}

// service storage account module - storage for the function app 
module serviceStorageAccountMod '../storageAccount.bicep' = {
  name: 'serviceStorageAcctName-${environmentName}-${buildNumber}'
  params: {
    storageAcctName: serviceStorageAcctName
    storageSkuName: serviceStorageAcctSku
    location: location
  }
}

// key vault module
module keyVaultMod '../keyVault.bicep' = {
  name: 'keyVaultName-${environmentName}-${buildNumber}'
  params: {
    keyVaultName: keyVaultName
    location: location
    cosmosPrimaryConnectionString: listConnectionStrings(resourceId('Microsoft.DocumentDB/databaseAccounts', cosmosAccountName), '2020-04-01').connectionStrings[0].connectionString
    cosmosSecondaryConnectionString: listConnectionStrings(resourceId('Microsoft.DocumentDB/databaseAccounts', cosmosAccountName), '2020-04-01').connectionStrings[1].connectionString
    productionPrincipalId: functionAppMod.outputs.productionPrincipalId
    productionTenantId: functionAppMod.outputs.productionTenantId
    stagingPrincipalId: functionAppMod.outputs.stagingPrincipalId
    stagingTenantId: functionAppMod.outputs.stagingTenantId
    serviceStorageConnectionString: serviceStorageAccountMod.outputs.storageAccountConnectionString
    appStorageConnectionString: appStorageAccountMod.outputs.storageAccountConnectionString
  }
  dependsOn:[
    functionAppMod
    cosmosAccountMod
    cosmosDatabaseMod
  ]
}

// function app settings module
module functionAppSettingMod '../functionAppSettings.bicep' = {
  name: 'functionAppSettings-${environmentName}-${buildNumber}'
  params: {
    appInsightsKey: appInsightsMod.outputs.appInsightsKey
    cosmosConnectionStringUri: keyVaultMod.outputs.cosmosPrimaryConnectionStringUri
    appStorageConnectionStringUri: keyVaultMod.outputs.appStorageConnectionStringUri
    serviceStorageConnectionStringUri: keyVaultMod.outputs.serviceStorageConnectionStringUri
    functionAppName: functionAppMod.outputs.prodSlotFunctionAppName
    functionAppStagingName: functionAppMod.outputs.stagingSlotFunctionAppName
  }
  dependsOn:[
    functionAppMod
    appInsightsMod
    cosmosAccountMod
    keyVaultMod
  ]
}

wsucoug69 commented 3 years ago

Also to clarify previously I was using the output in the cosmosAccount.bicep but changed to the query approach to try ad get away from the error. Thanks for the tip on raising the support ticket.

wsucoug69 commented 3 years ago

For run-time deployment errors you should raise a support ticket as they are best equipped to diagnose specific errors with an activity id.

However I am happy to look at an existing bicep file though to see if there are any issues.

I do have a sample on how to output the endpoint and key from a Cosmos account and input into appSettings for an App Service here if that helps.

https://github.com/Azure/azure-quickstart-templates/blob/master/quickstarts/microsoft.documentdb/cosmosdb-webapp/main.bicep

@alex-frankel Can you take a look at that? It seems the dependsOn is being fulfilled with the ack of the started and/or accepted responses rather than succeeded

wsucoug69 commented 3 years ago

My scenario includes creating a Cosmos Account, this typically takes a few minutes and sometimes up to 10 minutes.

@markjbrown - do you mind taking a look at this one? I'd expect the Cosmos Account not to report complete until it is fully provisioned. @zapadoody -- do you happen to have the code sample of the repro and a correlation ID when the error occured?

@alex-frankel any thoughts on the bicep here? Also I have opened a support case for this if you need that ref # let me know and I can send direct.

markjbrown commented 3 years ago

The problem is this listConnectionStrings function. I've never seen it before. I tried testing in an ARM template and it doesn't work (not sure why the template didn't fail validation).

If you want to output the endpoint and keys use this syntax below. To make it as a connection string just concat them together with "AccountEndpoint=" and ";AccountKey="

"[reference(resourceId('Microsoft.DocumentDB/databaseAccounts', variables('cosmosAccountName'))).documentEndpoint]" "[listKeys(resourceId('Microsoft.DocumentDB/databaseAccounts', variables('cosmosAccountName')), '2021-04-15').primaryMasterKey]"

wsucoug69 commented 2 years ago

The problem is this listConnectionStrings function. I've never seen it before. I tried testing in an ARM template and it doesn't work (not sure why the template didn't fail validation).

If you want to output the endpoint and keys use this syntax below. To make it as a connection string just concat them together with "AccountEndpoint=" and ";AccountKey="

"[reference(resourceId('Microsoft.DocumentDB/databaseAccounts', variables('cosmosAccountName'))).documentEndpoint]" "[listKeys(resourceId('Microsoft.DocumentDB/databaseAccounts', variables('cosmosAccountName')), '2021-04-15').primaryMasterKey]"

@markjbrown apologies thank you for the assistance!!!

brwilkinson commented 2 years ago

@zapadoody did this resolve your issue now?

erwinkramer commented 2 years ago

I think the most obvious reason why we need this is when you assign a role to an identity with: Microsoft.Authorization/roleAssignments and then do something with the role and identity in the same template, like with Microsoft.Resources/deploymentScripts for instance, or using something from a keyvault which it just got permissions from. This is not really nice to work with right now as it's almost guaranteed to fail at the first deployment, when the permissions are not set yet.

azMantas commented 2 years ago

I think the most obvious reason why we need this is when you assign a role to an identity with: Microsoft.Authorization/roleAssignments and then do something with the role and identity in the same template, like with Microsoft.Resources/deploymentScripts for instance, or using something from a keyvault which it just got permissions from. This is not really nice to work with right now as it's almost guaranteed to fail at the first deployment, when the permissions are not set yet.

at the role assignment template, try to set principalType to ServicePrincipal. It works like a charm in my environment.

erwinkramer commented 2 years ago

I think the most obvious reason why we need this is when you assign a role to an identity with: Microsoft.Authorization/roleAssignments and then do something with the role and identity in the same template, like with Microsoft.Resources/deploymentScripts for instance, or using something from a keyvault which it just got permissions from. This is not really nice to work with right now as it's almost guaranteed to fail at the first deployment, when the permissions are not set yet.

at the role assignment template, try to set principalType to ServicePrincipal. It works like a charm in my environment.

Does that guarantee anything? Setting roles, even manually, does not guarantee instant assignment of a role, this is what Microsoft documented itself, see https://docs.microsoft.com/en-us/azure/role-based-access-control/troubleshooting#role-assignment-changes-are-not-being-detected. In worst cases it takes 30 minutes, and I've seen it take over 5 minutes myself. I'm not saying that you're wrong in your scenario, just saying that not all scenario's will be instant with RBAC assignments.

bmoore-msft commented 2 years ago

@erwinkramer is correct, there are 2 problems with replication in this RBAC scenario 1) the MSI replicating through AAD/Azure so that a role can be assigned 2) the roleAssignment replicating through Azure so it takes effect

The principalType property solves the first but not the second.

In worst cases it takes 30 minutes, and I've seen it take over 5 minutes myself. I'm not saying that you're wrong in your scenario, just saying that not all scenario's will be instant with RBAC assignments. This is the challenge with wait/retry in general... When do you know that you should and how long do you wait for? We've talked about something like "wait until I can GET this resource" but that still has replication and fanout issues...

We understand the pain, and there are some workarounds (e.g. serial deployment of resources) - the current guidance from leadership is to solve the root cause.

RK6183 commented 2 years ago

For policy as well ... When you create an initiative definition then an initiative assignment > Error > Wait a bit between both > succes

azMantas commented 2 years ago

For policy as well ... When you create an initiative definition then an initiative assignment > Error > Wait a bit between both > succes

Azure CLI 'wait' command may be used to wait until resource provisioned with 'Succeeded' stage az deployment mg create --name deploymentName az deployment mg wait --name deploymentName --created --management-group-id mgmtName

milope commented 2 years ago

To add a comment here, I'm not sure why are we trying to find workarounds for a situation the resource provider should address. If the resource provider doesn't support concurrent operations, then serializing should be fine. However, if there's a situation like, resource A returns the operation as complete, but it's still doing something (e.g.: replication) then why is the Resource Provider signaling ARM that the operation is completed and ready for any other operation?

erwinkramer commented 2 years ago

However, if there's a situation like, resource A returns the operation as complete, but it's still doing something (e.g.: replication) then why is the Resource Provider signaling ARM that the operation is completed and ready for any other operation?

I'd rather be happy with seeing a completed status when it isn't yet fully replicated, then seeing my deployment being stalled when there is an outage in the datacenter which it is trying to replicate to. If they can just be sure to check resources in the same datacenter that it was deploying to, inside the same template, we wouldn't have this problem.

NickSpag commented 2 years ago

@eja-git this isn't a "wait" scenario, it's bug in the deployment engine job scheduling... the listKeys job is scheduled too early... so that's the fix for your particular scenario.

is there any place where this bug is tracked? no matter what we do we can't seem to consistently get the host key of a function after creating the resource. there's a littany of intermittent failures.

milope commented 2 years ago

However, if there's a situation like, resource A returns the operation as complete, but it's still doing something (e.g.: replication) then why is the Resource Provider signaling ARM that the operation is completed and ready for any other operation?

I'd rather be happy with seeing a completed status when it isn't yet fully replicated, then seeing my deployment being stalled when there is an outage in the datacenter which it is trying to replicate to. If they can just be sure to check resources in the same datacenter that it was deploying to, inside the same template, we wouldn't have this problem.

I don't know how I feel about creating an entire implementation based off the use-case of an outage. Another good example is Azure Firewall policies. If it's really not ready for a concurrent operation, why return the operation as completed until it's truly ready for another operation?

bmoore-msft commented 2 years ago

situation like, resource A returns the operation as complete, but it's still doing something (e.g.: replication) then why is the Resource Provider signaling ARM that the operation is completed and ready for any other operation?

The resources themselves don't replicate (nor do they know they are being replicated) - this is in a "lower layer" in ARM. Under all of this we're talking about a physics problem... it's not like an outage problem, which ARM can deal with to a large extent. We have a globally distributed system in which every physical location needs to have instantaneous knowledge of every other location - since that will likely never be possible we're looking for a different approach.

@NickSpag - the listkeys issue isn't related to replication, the fix is in the scheduling, the workaround is higher in the thread.

milope commented 2 years ago

I mean, I agree replication is a bad example, but there are certainly resources (like SQL, Azure Firewall) where sending a PUT request immediately after the last one was finished will result in a conflict because a previous operation is still running. Adding a special resource with a deployment script type to delay execution and have subsequent resources dependOn the deploymentScript sounds like a hacky way to approach this situation that can go wrong in many ways compared to just telling the resource provider "If you're not ready for another PUT request, you should stall the current one until you are". I thought that was the whole purpose behind the dependsOn situation.

alex-frankel commented 2 years ago

I thought that was the whole purpose behind the dependsOn situation.

DependsOn respects the status of what the resource provider tells us. If the resource tells us it's done deploying, then we move on. If it is not actually done when they say it is done, that is something that the resource provider should be fixing ASAP. They are violating the ARM resource provider contract and a support case should be able to get it resolved. I recognize that doesn't always happen, but if we supply a workaround, then the root cause is unlikely to get resolved.

bmoore-msft commented 2 years ago

@milope - I couldn't agree more. For that situation, since it's specific to each resourceType, would be good to call those out - i.e. which resources are behaving badly. We don't have a really good way to force the "fix" of each case across the hundreds of resourceTypes but it's becoming more visible so we'd like to start making traction on it.

Ppkd2021 commented 2 years ago

Facing similar challenge resource deployment failed with Internal server occurred. There should be a retry mechanism otherwise there is no other way to make the script robust as from azure we can get such errors frequently. Same template when we try to deploy manually and that time it worked.

NickSpag commented 2 years ago

will second @Ppkd2021 's use case. it's not uncommon to see intermittent failures on certain resources that clear with a redeploy. last week cognitive language and search services were trading off with a similar InternalServerError or TerminalState error, which were disrupting our dynamic environments built from PRs. i could see how from a correctness standpoint the feature could encourage poor design but it would be a real quality of life benefit from a practical perspective. appreciate the consideration!

mennolaan commented 2 years ago

I want to chime in as well. I'm creating a private endpoint and after that a privateDnsZones. With DependsOn, it still says cannot find private endpoint. My work around is deploying bogus resources that roughly take enough time for the private endpoint to finish, or to use yaml afterwards. I dont like creating extra resources as a wait mechanism and deleting them afterwards and using yaml as a mean of deploying resources outside of bicep afterwards shows us that bicep has huge flaws to make it workable. If a fresh deployment fails due to the fact dependson isn't enough, bicep is not fully workable imho. In a year time when a colleague uses that pipeline for something else, they will waste time on debugging a bug from bicep.

Whats the purpose of using bicep or any other language for that matter, if its not capable of doing a fresh deployment of resources?

I hope this issue will be put on the todo list as it makes using bicep not really a satisfying solution.

bmoore-msft commented 2 years ago

@mennolaan - do you happen to have a template we could use to repro?

jmighion commented 2 years ago

I sometimes see errors like:

{
    "status": "Failed",
    "error": {
        "code": "RetryableError",
        "message": "A retryable error occurred.",

Which would be a sensible thing to say if we were provided with a programmatic way to retry. Anything that we can provide to help get old RFE prioritized?

ilhaan commented 2 years ago

I see the same retryable error occurred constantly for various resources as well.

tw3lveparsecs commented 2 years ago

i get the same error "BMSUserErrorObjectLocked","message":"Another operation is in progress on the selected item." when re-running the same deployment to add a VM to a backup policy.

here is my code for reference.


@description('Optional. Add existing Azure virtual machine(s) to backup policy.')
@metadata({
  resourceId: 'Azure virtual machine resource id.'
  backupPolicyName: 'Backup policy name.'
})
param addVmToBackupPolicy array = []

var vmBackupConfig = [for vm in addVmToBackupPolicy: {
  backupPolicyName: vm.backupPolicyName
  resourceId: vm.resourceId
  backupFabric: 'Azure'
  protectionContainer: 'iaasvmcontainer;iaasvmcontainerv2;${split(vm.resourceId, '/')[4]};${last(split(vm.resourceId, '/'))}'
  protectedItem: 'vm;iaasvmcontainerv2;${split(vm.resourceId, '/')[4]};${last(split(vm.resourceId, '/'))}'
}]

resource vaultProtectedItem 'Microsoft.RecoveryServices/vaults/backupFabrics/protectionContainers/protectedItems@2022-03-01' = [for vm in vmBackupConfig: {
  name: '${vault.name}/${vm.backupFabric}/${vm.protectionContainer}/${vm.protectedItem}'
  properties: {
    protectedItemType: 'Microsoft.Compute/virtualMachines'
    policyId: '${vault.id}/backupPolicies/${vm.backupPolicyName}'
    sourceResourceId: vm.resourceId
  }
}]

pavelrozenberg commented 2 years ago

Maybe some delay deployments in the template will work for you. Try following this article: ARM Templates - Resource Deployment Delay

tw3lveparsecs commented 2 years ago

@pavelrozenberg thanks for the suggestion but i think something else is at play here as it works the first time, its just subsequent deployments/additions of VMs that result in the issue.

Ellestad1995 commented 2 years ago

I'm using a Azure Policy to register dns records from Azure PaaS services when they are created with a private endpoint. That process takes around 8 minutes to complete.

So when a deployment creates a private endpoint together with a service that is dependant on the existence of the record in a private dns zone, the dpeloyment fails because the deployment is quicker that the policy takes to register the dns record.

An approach would be to have some kind of function that "waitForExistenceOfResource" inside of a dependsOn. That way we could delay deployment of a resource until another entity have finished deploying.

An approach like this would need to time out after a certain timeperiod and not check too often of existence of resource.

Example:

Deploying:

Automation Account with Private Endpoint and Hybrid Worker
VM with HybridWorkerExtension

When the extension is added to the VM it starts to install and tries to resolve the AutomationAccountURL . But the policy that is responsible for adding the dns records for the PE of the Automation Account haven't completed. and therefor the extension fails.

northynorth commented 2 years ago

My use case is for SQL server deployment with auditing to Azure Monitor. To complete the configuration I need to set diagnostic settings on the master database, but this does not exist until some time after the SQL server resource deployment completes.

So even though there are dependancies on the sql server deployment for the master database diagnostics the template deployment would fail on first run with an error that the master database resource did not exist.

I have a workaround in place by using an inline deployment script that just sets a start-sleep delay, that is dependant on the sql server and then the master database diagnostics are dependant on the deployment script. Although the script doesn't need to do anything to be honest, the time taken to run the script deployment is long enough for the master database to be created.

A waitForExistanceOfResource function would be ideal in this case, although just being able to set a delay on the deployment of the resource after the initial dependency is met would likley work just as well.

Gordonby commented 2 years ago

Wait is available in the bicep registry as a workaround.

tw3lveparsecs commented 2 years ago

Wait is available in the bicep registry as a workaround.

@Gordonby that's good to know. Is there any way this can be used inside a loop? See my previous post for an example.

Azure / bicep

Add "wait" and "retry" deployment options #1013