Azure / deployment-stacks

Contains Deployment Stacks CLI scripts and releases
MIT License
87 stars 6 forks source link

Investigate retry loop on locked resource #90

Closed alex-frankel closed 1 year ago

alex-frankel commented 1 year ago

Lock on a nested/extension resource (something in log analytics), but not on the parent, caused a retry loop during deletion.

cc @slavizh / @snarkywolverine

slavizh commented 1 year ago

Behavior of deleting is the same even when the lock is directly on the resource that will be deleted - deployment stack deployments times out and fails after 2 hours I think.

snarkywolverine commented 1 year ago

I have been attempting to repro this using the steps in #81 - specifically item 3.

How are you securing the LogA workspace so it isn't deleted? I applied a read-only lock (Microsoft.Authorization/locks) on the workspace. I received the same error that you did, but the entire process (deployment + cleanup) took ~10 minutes rather than multiple hours.

slavizh commented 1 year ago

@snarkywolverine I had delete lock, Not sure if it matters but in case you have the same results with delete I will try to reproduce.

snarkywolverine commented 1 year ago

I changed to a delete lock and still had the process take <10 minutes. Please let me know your timing and correlation ID to investigate further.

slavizh commented 1 year ago

@snarkywolverine I also cannot reproduce it in the normal way. Previously I have reproduced in the way where there is a bug that accidentally adds a referenced resource to the managed stack resources. And when that resource is deleted at later point and has lock was taking around 2 hours. If that bug is fixed we will never get into that situation and in the normal scenario it will only take less 10 minutes to fail. So I think we are good as the long deployment seems also to be present in that specific bug case.

May be what can be improved is the error why the resource cannot be deleted.

First we get:

New-AzSubscriptionDeploymentStack: 11:26:39 - The deployment 'lz-analysis-services-monitoring' failed with error(s). Showing 3 out of 3 error(s).
Error: Code=DeploymentStackUpdateFailed; Message=One or more stages of the deploymentStack failed. Correlation id: 'c579637b-7531-407f-b532-01285a73038d'

Error: Code=DeploymentStackDeleteResourcesFailed; Message=One or more resources could not be deleted.

Error: Code=DeploymentStackDeleteResourcesFailed; Message=An unknown error occurred while trying to delete resources. These resources are still present in the stack but can be deleted manually.

The way I see it there is just one error and that is that a resource cannot be deleted.

further below in deployment output we also get:

FailedResources             : {
                                id: /subscriptions/<sub id>/resourceGroups/test-analysisserv-rg/providers/Microsoft.Insights/scheduledQueryRules/f9125df9-eb5d-4967-8517-52c68f6f9dd2
                                error: Resource could not be deleted. Resource is still present in stack.
                              }
Error                       : One or more stages of the deploymentStack failed. Correlation id: 'c579637b-7531-407f-b532-01285a73038d' (Code: DeploymentStackUpdateFailed)
                               - One or more resources could not be deleted. (Code: DeploymentStackDeleteResourcesFailed)
                                 - An unknown error occurred while trying to delete resources. These resources are still present in the stack but can be deleted manually. (Code:DeploymentStackDeleteResourcesF
                              ailed)

I think the PowerShell error and the deployment output error should be very similar or the same otherwise you start to wonder which is the exact issue. Second there is only one error not 3, Third you should be able to get why the resource is failing deletion and that is because it has lock. For example activity log surfaces such information so probably you are not getting the real reason.

snarkywolverine commented 1 year ago

Can you try to repro using the referenced resource bug from #81 and see if it still takes 2 hours? If so, I'd like the correlation ID from that update run -- if the delete process is what's timing out after 2 hours, it shouldn't matter how the resource was added to the stack (though I understand that #81 makes this easier to reproduce...).

slavizh commented 1 year ago

Unfortunately I cannot reproduce bug #81 anymore so may be it was fixed?

snarkywolverine commented 1 year ago

As I recall, your environment for #81 was in West Europe? It probably will repro in either of the UK regions, or possibly North Europe, if you want to try there.

slavizh commented 1 year ago

I was able to re-produce #81 in UK West but I was not able to reproduce the long delay/time out in deletion of locked resource. Either there has been some other change or there is some step that I have missed in reproducing it before and now I do not know which one it is. Apologies for taking your time on this.

snarkywolverine commented 1 year ago

Thanks for confirming -- I'll go ahead and close this.