Azure / deployment-stacks

Contains Deployment Stacks CLI scripts and releases
MIT License
90 stars 7 forks source link

Deployment Stack - Resource deletion order #173

Open AlexanderSehr opened 4 months ago

AlexanderSehr commented 4 months ago

Bicep version Bicep CLI version 0.28.1 (ba1e9f8c1e)

Describe the bug Running a command like Remove-AzSubscriptionDeploymentStack -Name 'imageStack' -DeleteAll -Verbose -Force for a Deployment Stack of the same name seems not to consider certain contraints when it comes to the order of resource deletion.

In the given case I'm deploying (amongst other things) a User-Assigned Identity + an Image Template to which the same is attached. When removing these resources again, it is vital that the User-Assigned Identity is removed after the Image Template as the same otherwise ends up in a non-removable state. This looks like this: image

In the terminal you get a stuck command (which does not respond to cancelation keys like Ctrl + C) like

> Remove-AzSubscriptionDeploymentStack -Name 'imageStack' -DeleteAll -Verbose -Force
VERBOSE: Performing the operation "Deleting Deployment Stack ..." on target "imageStack".

This is just one example of many and the reason why we implemented a very specific removal order & logic in the AVM CI.

Naturally, I do not know how Deployment Stacks handle removals in the background as suspect it just runs a removal on the Resource Group or similar. Regardless, the logic may need to be update to be more granular (and ideally also handle purge protection, etc.). Happy to talk through some of the things we identified in CARML & AVM over the years and logic we implementated for the same.

Until these issues are not addressed we won't be able to replace our current removal logic with Deployment Stacks.

To Reproduce Steps to reproduce the behavior:

It should be enough to deploy the MSI + Image Template (alongside any required resource) - OR - if you want to test with the full example, deploy this template with updated parameters for unique resources like the ACR & StorageAccount, and using the command New-AzSubscriptionDeploymentStack -Name 'imageStack' -Description 'This is a first test deployment using Deployment Stacks' -Location 'WestEurope' -TemplateFile '(...)\constructs\azureImageBuilder\deploymentFiles\sbx.image.bicep' -Verbose -TemplateParameterObject @{ deploymentsToPerform = 'All' } -DenySettingsMode 'None', followed by the eventual removal command.

Additional context PS: Fundamentally a very cool feature though (:

snarkywolverine commented 4 months ago

@AlexanderSehr Do you have a Correlation ID of the DELETE request?

If I understand the scenario correctly, the DELETE of the image template fails because the managed identity is already gone - is that correct? If so, does that make logical sense to you as a customer? I'm asking because I'm not very familiar with image templates, but I would have expected the image template deletion to succeed regardless of MSI state.

AlexanderSehr commented 4 months ago

Hey @snarkywolverine, I hope you're well. You can find the deployment of the original stack via this correlation ID: 83067d53-1a92-4d07-ad22-13f6b4ce4490. Note that it shows an unrelated error at the late stage of the deployment, which is due to a networking issue and is not linked to the issue I describe above. Instead, the issue came up following the deployment by running the Deployment Stack removal.

It may appear I did not do a great job in describen the order of events in my original posting, leading to some confusion. Sorry for that, but I shall try again: For context: The ImageTemplate resource takes a user-assigned identity and effectively uses it to create some Azure resources to build an image and that's it. It's the only requirement it really has. Now, as you correctly stated, it's removal failed/got stuck because the Deployment Stack destroy operation first deleted the MSI, and then tried to delete the Image Template. If that happens the following things are true:

Does that make sense to me as a customer? No, certainly not 😄. It's just the way the resource is implemented and sure it would be nice if the corresponding PG 'fixes' that issue. However, it's just one of many many resources that face similar issues (e.g., diagnosticSettings, container-groups, role assignments, etc.) in that the order of removal and that they're explicitely removed matters. It's the sole reason we've put so much work in our own CI in the BRM repository and declared an explicit order for resources that we know must be removed in set order.

alex-frankel commented 4 months ago

@snarkywolverine do we know if the RG delete algorithm accounts for some of these behaviors and/or does Bulk Delete allow for any sequencing?

slavizh commented 4 months ago

The behavior I know when it was previously explained to me is that the order does not matter as deployment stack retries deletion operation multiple times until succeeds or certain time out is reached. So if you have dependent resources eventually their dependencies (if present in the stack) will get deleted. The above error sounds like the RP implemented some custom logic that requires resources to be deleted in some order in order to be able to delete them. This is not something RPs should implement.

AlexanderSehr commented 4 months ago

The behavior I know when it was previously explained to me is that the order does not matter as deployment stack retries deletion operation multiple times until succeeds or certain time out is reached. So if you have dependent resources eventually their dependencies (if present in the stack) will get deleted. The above error sounds like the RP implemented some custom logic that requires resources to be deleted in some order in order to be able to delete them. This is not something RPs should implement.

100%

It's the same reason why we built in a retry and move resource back in the removal list if they're still depending on something else (e.g., a VNET cannot be deleted before a VM that's deployed into it, so the VNET removal may fail the first time around but succeed the second). Turns out, quite a few RPs may have had different plans (maybe for good reason, who knows).