Open rshariy opened 3 years ago
Understood. This is something we have been considering, but haven't scheduled the work yet. If you (or others) have other examples that you have run into, it would be great to capture those here.
I know RBAC replication (and replication delays in general) are another place where something like this would be helpful.
I know RBAC replication (and replication delays in general) are another place where something like this would be helpful.
@alex-frankel I'm assuming this is something we're planning on also addressing in the underlying platform? This feels like a leaky abstraction, not something that the end-user should have to deal with by adding delays.
This feels like a leaky abstraction, not something that the end-user should have to deal with by adding delays.
Agreed. @bmoore-msft and I were also discussing this yesterday. Ideally, ARM will co-locate all the calls end-to-end so a user never has to think about this. Not sure if/when that will be possible, and this may be a necessary evil in the meantime.
The OP doesn't sound like replication (feels like concurrency) though I could see that you could potentially address both with something like retry. The problem in this case (or either really case) is indefinite postponement. This feels like a problem with the RP - common operations returning frequent 400s instead of maybe 429.
The challenge with this workaround is not only does the user have to fail, then implement a non-deterministic work around (that's expensive on the service) it will mask problems with across ARM, RPs and user code.
@rshariy - have you raised this issue with the RSV team? It doesn't appear to be an uncommon problem and seems like it should be addressed by the RSV... either it shouldn't happen or we're not helping customer figure out how to effectively use RSV.
@bmoore-msft I raised a similar issue with the Azure Firewall product team about a year ago - the only solution we found is to use a PowerShell function to check Azure FW status (make sure it is not "updating") before kicking-off new ARM deployment to FW.
Just logged ticket 120120226003381 about the RSV issue - lets see what MS support will come up with.
it will mask problems with across ARM, RPs and user code.
this point is what gives us caution on implementing something like this. We have some potential solutions to deal with the replication delay in particular that we will explore before introducing a wait.
@rshariy - please let us know the resolution of the case.
I have a main template that looks like this:
module kv 'keyvault.bicep' = {
name: 'kvSmoketestDeploy'
scope: rg
params: {
keyVaultName: keyVaultName
enableSoftDelete: false
}
}
module kvaccpol 'keyvaultaccesspolicy.bicep' = {
name: 'kvAccPolSmoketestDeploy'
scope: rg
params: {
keyVaultName: keyVaultName
action: 'add'
objectId: objectId
access: keyVaultAccessPolicyAccess
}
}
When that runs, the deployment breaks with:
{
"error": {
"code": "ParentResourceNotFound",
"message": "Can not perform requested operation on nested resource. Parent resource 'kv-kvaccpoltest' not found."
}
} (Code:NotFound)
Running the deployment again, deploys the policy
I ran into a scenario where I'd like a wait, not much code to show, basically deploying a FunctionApp, then want to output the default key for use in Api Management. The problem is the function app takes some time to spin up before the app keys are present...
resource functionApp 'Microsoft.Web/sites@2020-06-01' = {
name: functionAppName
location: location
kind: 'functionapp'
...
output functionappdefaultkey string = listKeys('${functionApp.id}/host/default', functionApp.apiVersion).functionKeys.default
Workaround is to run the initial deployment of the function app twice.
@eja-git this isn't a "wait" scenario, it's bug in the deployment engine job scheduling... the listKeys job is scheduled too early... so that's the fix for your particular scenario.
Hi,
I've logged the following issue https://github.com/projectkudu/kudu/issues/3312#issuecomment-870741730 that could also benefit from the wait option during a deployment.
Best Regards Pieter
I am trying to simplify firewall rule collection deploying by using loadTextContent
and then loop from each variable. workload-x.json contains all properties for rule collection.
var workloads = [
json(loadTextContent('./workload-1.json'))
json(loadTextContent('./workload-2.json'))
json(loadTextContent('./workload-3.json'))
]
resource afwPolicy 'Microsoft.Network/firewallPolicies@2021-02-01' existing = {
name: 'bicepRules'
}
resource collectionGroups 'Microsoft.Network/firewallPolicies/ruleCollectionGroups@2021-02-01' = [for workload in workloads: {
name: workload.name
parent: afwPolicy
properties: workload.properties
}]
here is the error I get
Rule Collection Group workload-2 can not be updated because Parent Firewall Policy bicepRules is in Updating state from previous operation
I am sure that a short delay between deployments would help us to loop through all array
Only one Rule Collection Group can be updated at a time with Azure Firewall Policy. Since the update refreshes all of the connected Azure Firewall instances, the amount of time it takes to update is non-deterministic. Therefore you will need to serialize the deployment using the batchSize
decorator.
Can you try:
@batchSize(1)
resource collectionGroups 'Microsoft.Network/firewallPolicies/ruleCollectionGroups@2021-02-01' = [for workload in workloads: {
name: workload.name
parent: afwPolicy
properties: workload.properties
}]
I have two scenarios that come to mind from recent experience.
Overarching enterprise management level policy being applied to a resource that has been created which I reference in next resource/module causing the Another Operation error. A retry would be useful here as I have no control or influence over the Policies.
I have also faced situations where a newly created resource is not available when referenced immediately afterwards which I assume is a replication/caching issue as the next run works flawlessly.
My scenario includes creating a Cosmos Account, this typically takes a few minutes and sometimes up to 10 minutes. In this case I am unable to use the resource output to set the connection string for use in subsequent modules e.g. passing into keyVault and functionAppSettings
My scenario includes creating a Cosmos Account, this typically takes a few minutes and sometimes up to 10 minutes.
@markjbrown - do you mind taking a look at this one? I'd expect the Cosmos Account not to report complete until it is fully provisioned. @zapadoody -- do you happen to have the code sample of the repro and a correlation ID when the error occured?
For run-time deployment errors you should raise a support ticket as they are best equipped to diagnose specific errors with an activity id.
However I am happy to look at an existing bicep file though to see if there are any issues.
I do have a sample on how to output the endpoint and key from a Cosmos account and input into appSettings for an App Service here if that helps.
here's my cosmosAccount.bicep
param location string
param cosmosAccountName string
param cosmosDefaultConsistencyPolicy string
param cosmosPrimaryRegion string
param cosmosSecondaryRegion string
var lowerCosmosAcctName = toLower(cosmosAccountName)
var locations = [
{
locationName: cosmosPrimaryRegion
failoverPriority: 0
isZoneRedundant: false
}
{
locationName: cosmosSecondaryRegion
failoverPriority: 1
isZoneRedundant: false
}
]
resource cosmosAccountResource 'Microsoft.DocumentDB/databaseAccounts@2021-06-15' = {
name: lowerCosmosAcctName
kind: 'GlobalDocumentDB'
location: location
properties: {
locations: locations
databaseAccountOfferType: 'Standard'
enableAutomaticFailover: true
consistencyPolicy: {
defaultConsistencyLevel: cosmosDefaultConsistencyPolicy
}
}
}
output cosmosAccountResourceName string = cosmosAccountResource.name
here's the KeyVault.bicep
param location string
param keyVaultName string
param productionPrincipalId string
param productionTenantId string
param stagingPrincipalId string
param stagingTenantId string
@secure()
param cosmosPrimaryConnectionString string
@secure()
param cosmosSecondaryConnectionString string
@secure()
param serviceStorageConnectionString string
@secure()
param appStorageConnectionString string
resource keyVault 'Microsoft.KeyVault/vaults@2019-09-01' = {
name: keyVaultName
location: location
properties: {
enabledForDeployment: true
enabledForTemplateDeployment: true
enabledForDiskEncryption: true
tenantId: productionTenantId
accessPolicies: [
{
tenantId: productionTenantId
objectId: productionPrincipalId
permissions: {
secrets: [
'get'
'list'
]
}
}
{
tenantId: stagingTenantId
objectId: stagingPrincipalId
permissions: {
secrets: [
'get'
'list'
]
}
}
]
sku: {
name: 'standard'
family: 'A'
}
}
}
resource cosmosPrimaryConnectionStringSecret 'Microsoft.KeyVault/vaults/secrets@2019-09-01' = {
name: '${keyVaultName}/cosmosPrimaryConnectionString'
properties: {
value: cosmosPrimaryConnectionString
}
dependsOn:[
keyVault
]
}
resource cosmosSecondaryConnectionStringSecret 'Microsoft.KeyVault/vaults/secrets@2019-09-01' = {
name: '${keyVaultName}/cosmosSecondaryConnectionString'
properties: {
value: cosmosSecondaryConnectionString
}
dependsOn:[
keyVault
]
}
resource serviceStorageConnectionStringSecret 'Microsoft.KeyVault/vaults/secrets@2019-09-01' = {
name: '${keyVaultName}/dbConnectionString'
properties: {
value: serviceStorageConnectionString
}
dependsOn:[
keyVault
]
}
resource appStorageConnectionStringSecret 'Microsoft.KeyVault/vaults/secrets@2019-09-01' = {
name: '${keyVaultName}/appStorageConnectionString'
properties: {
value: appStorageConnectionString
}
dependsOn:[
keyVault
]
}
output appStorageConnectionStringUri string = appStorageConnectionStringSecret.properties.secretUri
output serviceStorageConnectionStringUri string = serviceStorageConnectionStringSecret.properties.secretUri
output cosmosPrimaryConnectionStringUri string = cosmosPrimaryConnectionStringSecret.properties.secretUri
output cosmosSecondaryConnectionStringUri string = cosmosSecondaryConnectionStringSecret.properties.secretUri
and here's the main.bicep
/// cosmos db account, database and container module
module cosmosAccountMod '../cosmosAccount.bicep' = {
name: 'cosmosAccount-${environmentName}-${buildNumber}'
params: {
cosmosAccountName: cosmosAccountName
cosmosDefaultConsistencyPolicy: cosmosDefaultConsistencyPolicy
cosmosPrimaryRegion: cosmosPrimaryRegion
cosmosSecondaryRegion: cosmosSecondaryRegion
location: location
}
}
module cosmosDatabaseMod '../cosmosDbContainer.bicep' = {
name: 'cosmosDBContainer-${environmentName}-${buildNumber}'
params: {
cosmosAccountName: cosmosAccountMod.outputs.cosmosAccountResourceName
cosmosContainerName: cosmosContainerName
cosmosDatabaseName: cosmosDatabaseName
cosmosThroughput: cosmosThroughput
}
dependsOn: [
cosmosAccountMod
]
}
// storage account module - storage for the tenants application
module appStorageAccountMod '../storageAccount.bicep' = {
name: 'appStorageAcctName-${environmentName}-${buildNumber}'
params: {
storageAcctName: appStorageAcctName
storageSkuName: appStorageAcctSku
location: location
}
}
// app insights module
module appInsightsMod '../appInsights.bicep' = {
name: 'appInsightsName-${environmentName}-${buildNumber}'
params: {
name: appInsightsName
resourceGroupLocation: location
}
}
// app service plan module
module appServicePlanMod '../appServicePlan.bicep' = {
name: 'appServicePlan-${environmentName}-${buildNumber}'
params: {
appSvcPlanSku: appSvcPlanSku
appSvcPlanTier: appSvcPlanTier
appSvcPlanName: appSvcPlanName
appPlanLocation: location
}
}
// function app module
module functionAppMod '../functionApp.bicep' = {
name: 'functionApp-${environmentName}-${buildNumber}'
params: {
appSvcPlanName: appSvcPlanName
functionAppName: functionAppName
location: location
}
dependsOn: [
appStorageAccountMod
appServicePlanMod
cosmosAccountMod
]
}
// service storage account module - storage for the function app
module serviceStorageAccountMod '../storageAccount.bicep' = {
name: 'serviceStorageAcctName-${environmentName}-${buildNumber}'
params: {
storageAcctName: serviceStorageAcctName
storageSkuName: serviceStorageAcctSku
location: location
}
}
// key vault module
module keyVaultMod '../keyVault.bicep' = {
name: 'keyVaultName-${environmentName}-${buildNumber}'
params: {
keyVaultName: keyVaultName
location: location
cosmosPrimaryConnectionString: listConnectionStrings(resourceId('Microsoft.DocumentDB/databaseAccounts', cosmosAccountName), '2020-04-01').connectionStrings[0].connectionString
cosmosSecondaryConnectionString: listConnectionStrings(resourceId('Microsoft.DocumentDB/databaseAccounts', cosmosAccountName), '2020-04-01').connectionStrings[1].connectionString
productionPrincipalId: functionAppMod.outputs.productionPrincipalId
productionTenantId: functionAppMod.outputs.productionTenantId
stagingPrincipalId: functionAppMod.outputs.stagingPrincipalId
stagingTenantId: functionAppMod.outputs.stagingTenantId
serviceStorageConnectionString: serviceStorageAccountMod.outputs.storageAccountConnectionString
appStorageConnectionString: appStorageAccountMod.outputs.storageAccountConnectionString
}
dependsOn:[
functionAppMod
cosmosAccountMod
cosmosDatabaseMod
]
}
// function app settings module
module functionAppSettingMod '../functionAppSettings.bicep' = {
name: 'functionAppSettings-${environmentName}-${buildNumber}'
params: {
appInsightsKey: appInsightsMod.outputs.appInsightsKey
cosmosConnectionStringUri: keyVaultMod.outputs.cosmosPrimaryConnectionStringUri
appStorageConnectionStringUri: keyVaultMod.outputs.appStorageConnectionStringUri
serviceStorageConnectionStringUri: keyVaultMod.outputs.serviceStorageConnectionStringUri
functionAppName: functionAppMod.outputs.prodSlotFunctionAppName
functionAppStagingName: functionAppMod.outputs.stagingSlotFunctionAppName
}
dependsOn:[
functionAppMod
appInsightsMod
cosmosAccountMod
keyVaultMod
]
}
Also to clarify previously I was using the output in the cosmosAccount.bicep but changed to the query approach to try ad get away from the error. Thanks for the tip on raising the support ticket.
For run-time deployment errors you should raise a support ticket as they are best equipped to diagnose specific errors with an activity id.
However I am happy to look at an existing bicep file though to see if there are any issues.
I do have a sample on how to output the endpoint and key from a Cosmos account and input into appSettings for an App Service here if that helps.
@alex-frankel Can you take a look at that? It seems the dependsOn is being fulfilled with the ack of the started and/or accepted responses rather than succeeded
My scenario includes creating a Cosmos Account, this typically takes a few minutes and sometimes up to 10 minutes.
@markjbrown - do you mind taking a look at this one? I'd expect the Cosmos Account not to report complete until it is fully provisioned. @zapadoody -- do you happen to have the code sample of the repro and a correlation ID when the error occured?
@alex-frankel any thoughts on the bicep here? Also I have opened a support case for this if you need that ref # let me know and I can send direct.
The problem is this listConnectionStrings function. I've never seen it before. I tried testing in an ARM template and it doesn't work (not sure why the template didn't fail validation).
If you want to output the endpoint and keys use this syntax below. To make it as a connection string just concat them together with "AccountEndpoint=" and ";AccountKey="
"[reference(resourceId('Microsoft.DocumentDB/databaseAccounts', variables('cosmosAccountName'))).documentEndpoint]" "[listKeys(resourceId('Microsoft.DocumentDB/databaseAccounts', variables('cosmosAccountName')), '2021-04-15').primaryMasterKey]"
The problem is this listConnectionStrings function. I've never seen it before. I tried testing in an ARM template and it doesn't work (not sure why the template didn't fail validation).
If you want to output the endpoint and keys use this syntax below. To make it as a connection string just concat them together with "AccountEndpoint=" and ";AccountKey="
"[reference(resourceId('Microsoft.DocumentDB/databaseAccounts', variables('cosmosAccountName'))).documentEndpoint]" "[listKeys(resourceId('Microsoft.DocumentDB/databaseAccounts', variables('cosmosAccountName')), '2021-04-15').primaryMasterKey]"
@markjbrown apologies thank you for the assistance!!!
@zapadoody did this resolve your issue now?
I think the most obvious reason why we need this is when you assign a role to an identity with: Microsoft.Authorization/roleAssignments
and then do something with the role and identity in the same template, like with Microsoft.Resources/deploymentScripts
for instance, or using something from a keyvault which it just got permissions from. This is not really nice to work with right now as it's almost guaranteed to fail at the first deployment, when the permissions are not set yet.
I think the most obvious reason why we need this is when you assign a role to an identity with:
Microsoft.Authorization/roleAssignments
and then do something with the role and identity in the same template, like withMicrosoft.Resources/deploymentScripts
for instance, or using something from a keyvault which it just got permissions from. This is not really nice to work with right now as it's almost guaranteed to fail at the first deployment, when the permissions are not set yet.
at the role assignment template, try to set principalType to ServicePrincipal
. It works like a charm in my environment.
I think the most obvious reason why we need this is when you assign a role to an identity with:
Microsoft.Authorization/roleAssignments
and then do something with the role and identity in the same template, like withMicrosoft.Resources/deploymentScripts
for instance, or using something from a keyvault which it just got permissions from. This is not really nice to work with right now as it's almost guaranteed to fail at the first deployment, when the permissions are not set yet.at the role assignment template, try to set principalType to
ServicePrincipal
. It works like a charm in my environment.
Does that guarantee anything? Setting roles, even manually, does not guarantee instant assignment of a role, this is what Microsoft documented itself, see https://docs.microsoft.com/en-us/azure/role-based-access-control/troubleshooting#role-assignment-changes-are-not-being-detected. In worst cases it takes 30 minutes, and I've seen it take over 5 minutes myself. I'm not saying that you're wrong in your scenario, just saying that not all scenario's will be instant with RBAC assignments.
@erwinkramer is correct, there are 2 problems with replication in this RBAC scenario 1) the MSI replicating through AAD/Azure so that a role can be assigned 2) the roleAssignment replicating through Azure so it takes effect
The principalType
property solves the first but not the second.
In worst cases it takes 30 minutes, and I've seen it take over 5 minutes myself. I'm not saying that you're wrong in your scenario, just saying that not all scenario's will be instant with RBAC assignments. This is the challenge with wait/retry in general... When do you know that you should and how long do you wait for? We've talked about something like "wait until I can GET this resource" but that still has replication and fanout issues...
We understand the pain, and there are some workarounds (e.g. serial deployment of resources) - the current guidance from leadership is to solve the root cause.
For policy as well ... When you create an initiative definition then an initiative assignment > Error > Wait a bit between both > succes
For policy as well ... When you create an initiative definition then an initiative assignment > Error > Wait a bit between both > succes
Azure CLI 'wait' command may be used to wait until resource provisioned with 'Succeeded' stage
az deployment mg create --name deploymentName
az deployment mg wait --name deploymentName --created --management-group-id mgmtName
To add a comment here, I'm not sure why are we trying to find workarounds for a situation the resource provider should address. If the resource provider doesn't support concurrent operations, then serializing should be fine. However, if there's a situation like, resource A returns the operation as complete, but it's still doing something (e.g.: replication) then why is the Resource Provider signaling ARM that the operation is completed and ready for any other operation?
However, if there's a situation like, resource A returns the operation as complete, but it's still doing something (e.g.: replication) then why is the Resource Provider signaling ARM that the operation is completed and ready for any other operation?
I'd rather be happy with seeing a completed status when it isn't yet fully replicated, then seeing my deployment being stalled when there is an outage in the datacenter which it is trying to replicate to. If they can just be sure to check resources in the same datacenter that it was deploying to, inside the same template, we wouldn't have this problem.
@eja-git this isn't a "wait" scenario, it's bug in the deployment engine job scheduling... the listKeys job is scheduled too early... so that's the fix for your particular scenario.
is there any place where this bug is tracked? no matter what we do we can't seem to consistently get the host key of a function after creating the resource. there's a littany of intermittent failures.
However, if there's a situation like, resource A returns the operation as complete, but it's still doing something (e.g.: replication) then why is the Resource Provider signaling ARM that the operation is completed and ready for any other operation?
I'd rather be happy with seeing a completed status when it isn't yet fully replicated, then seeing my deployment being stalled when there is an outage in the datacenter which it is trying to replicate to. If they can just be sure to check resources in the same datacenter that it was deploying to, inside the same template, we wouldn't have this problem.
I don't know how I feel about creating an entire implementation based off the use-case of an outage. Another good example is Azure Firewall policies. If it's really not ready for a concurrent operation, why return the operation as completed until it's truly ready for another operation?
situation like, resource A returns the operation as complete, but it's still doing something (e.g.: replication) then why is the Resource Provider signaling ARM that the operation is completed and ready for any other operation?
The resources themselves don't replicate (nor do they know they are being replicated) - this is in a "lower layer" in ARM. Under all of this we're talking about a physics problem... it's not like an outage problem, which ARM can deal with to a large extent. We have a globally distributed system in which every physical location needs to have instantaneous knowledge of every other location - since that will likely never be possible we're looking for a different approach.
@NickSpag - the listkeys issue isn't related to replication, the fix is in the scheduling, the workaround is higher in the thread.
I mean, I agree replication is a bad example, but there are certainly resources (like SQL, Azure Firewall) where sending a PUT request immediately after the last one was finished will result in a conflict because a previous operation is still running. Adding a special resource with a deployment script type to delay execution and have subsequent resources dependOn the deploymentScript sounds like a hacky way to approach this situation that can go wrong in many ways compared to just telling the resource provider "If you're not ready for another PUT request, you should stall the current one until you are". I thought that was the whole purpose behind the dependsOn situation.
I thought that was the whole purpose behind the dependsOn situation.
DependsOn respects the status of what the resource provider tells us. If the resource tells us it's done deploying, then we move on. If it is not actually done when they say it is done, that is something that the resource provider should be fixing ASAP. They are violating the ARM resource provider contract and a support case should be able to get it resolved. I recognize that doesn't always happen, but if we supply a workaround, then the root cause is unlikely to get resolved.
@milope - I couldn't agree more. For that situation, since it's specific to each resourceType, would be good to call those out - i.e. which resources are behaving badly. We don't have a really good way to force the "fix" of each case across the hundreds of resourceTypes but it's becoming more visible so we'd like to start making traction on it.
Facing similar challenge resource deployment failed with Internal server occurred. There should be a retry mechanism otherwise there is no other way to make the script robust as from azure we can get such errors frequently. Same template when we try to deploy manually and that time it worked.
will second @Ppkd2021 's use case. it's not uncommon to see intermittent failures on certain resources that clear with a redeploy. last week cognitive language and search services were trading off with a similar InternalServerError or TerminalState error, which were disrupting our dynamic environments built from PRs. i could see how from a correctness standpoint the feature could encourage poor design but it would be a real quality of life benefit from a practical perspective. appreciate the consideration!
I want to chime in as well. I'm creating a private endpoint and after that a privateDnsZones. With DependsOn, it still says cannot find private endpoint. My work around is deploying bogus resources that roughly take enough time for the private endpoint to finish, or to use yaml afterwards. I dont like creating extra resources as a wait mechanism and deleting them afterwards and using yaml as a mean of deploying resources outside of bicep afterwards shows us that bicep has huge flaws to make it workable. If a fresh deployment fails due to the fact dependson isn't enough, bicep is not fully workable imho. In a year time when a colleague uses that pipeline for something else, they will waste time on debugging a bug from bicep.
Whats the purpose of using bicep or any other language for that matter, if its not capable of doing a fresh deployment of resources?
I hope this issue will be put on the todo list as it makes using bicep not really a satisfying solution.
@mennolaan - do you happen to have a template we could use to repro?
I sometimes see errors like:
{
"status": "Failed",
"error": {
"code": "RetryableError",
"message": "A retryable error occurred.",
Which would be a sensible thing to say if we were provided with a programmatic way to retry. Anything that we can provide to help get old RFE prioritized?
I see the same retryable error occurred
constantly for various resources as well.
i get the same error "BMSUserErrorObjectLocked","message":"Another operation is in progress on the selected item."
when re-running the same deployment to add a VM to a backup policy.
here is my code for reference.
@description('Optional. Add existing Azure virtual machine(s) to backup policy.')
@metadata({
resourceId: 'Azure virtual machine resource id.'
backupPolicyName: 'Backup policy name.'
})
param addVmToBackupPolicy array = []
var vmBackupConfig = [for vm in addVmToBackupPolicy: {
backupPolicyName: vm.backupPolicyName
resourceId: vm.resourceId
backupFabric: 'Azure'
protectionContainer: 'iaasvmcontainer;iaasvmcontainerv2;${split(vm.resourceId, '/')[4]};${last(split(vm.resourceId, '/'))}'
protectedItem: 'vm;iaasvmcontainerv2;${split(vm.resourceId, '/')[4]};${last(split(vm.resourceId, '/'))}'
}]
resource vaultProtectedItem 'Microsoft.RecoveryServices/vaults/backupFabrics/protectionContainers/protectedItems@2022-03-01' = [for vm in vmBackupConfig: {
name: '${vault.name}/${vm.backupFabric}/${vm.protectionContainer}/${vm.protectedItem}'
properties: {
protectedItemType: 'Microsoft.Compute/virtualMachines'
policyId: '${vault.id}/backupPolicies/${vm.backupPolicyName}'
sourceResourceId: vm.resourceId
}
}]
Maybe some delay deployments in the template will work for you. Try following this article: ARM Templates - Resource Deployment Delay
@pavelrozenberg thanks for the suggestion but i think something else is at play here as it works the first time, its just subsequent deployments/additions of VMs that result in the issue.
I'm using a Azure Policy to register dns records from Azure PaaS services when they are created with a private endpoint. That process takes around 8 minutes to complete.
So when a deployment creates a private endpoint together with a service that is dependant on the existence of the record in a private dns zone, the dpeloyment fails because the deployment is quicker that the policy takes to register the dns record.
An approach would be to have some kind of function that "waitForExistenceOfResource" inside of a dependsOn. That way we could delay deployment of a resource until another entity have finished deploying.
An approach like this would need to time out after a certain timeperiod and not check too often of existence of resource.
Example:
Deploying:
When the extension is added to the VM it starts to install and tries to resolve the AutomationAccountURL
. But the policy that is responsible for adding the dns records for the PE of the Automation Account haven't completed. and therefor the extension fails.
My use case is for SQL server deployment with auditing to Azure Monitor. To complete the configuration I need to set diagnostic settings on the master database, but this does not exist until some time after the SQL server resource deployment completes.
So even though there are dependancies on the sql server deployment for the master database diagnostics the template deployment would fail on first run with an error that the master database resource did not exist.
I have a workaround in place by using an inline deployment script that just sets a start-sleep delay, that is dependant on the sql server and then the master database diagnostics are dependant on the deployment script. Although the script doesn't need to do anything to be honest, the time taken to run the script deployment is long enough for the master database to be created.
A waitForExistanceOfResource function would be ideal in this case, although just being able to set a delay on the deployment of the resource after the initial dependency is met would likley work just as well.
Wait is available in the bicep registry as a workaround.
Wait is available in the bicep registry as a workaround.
@Gordonby that's good to know. Is there any way this can be used inside a loop? See my previous post for an example.
ARM template deployment often fails with errors like:
"Another operation is in progress on the selected item. If there is an in-progress operation, please retry after it has finished."
"BMSUserErrorObjectLocked","message":"Another operation is in progress on the selected item."
Just to clarity - this is not a dependency issue. ARM deployment may fail if ,for example, you try to add a VM to an RSV and there is another VM being added at the same time: for a few seconds RSV will not accept new clients and as the result your deployment will fail.
Would like to have an option to pause deployment and/or retry it - may be introduce the "wait" and "retry" deployment conditions, i.e: