Azure / deployment-stacks

Contains Deployment Stacks CLI scripts and releases
MIT License
87 stars 6 forks source link

Deployment stack deployment attempts to delete referenced resources #147

Closed hallgeir-osterbo-visma closed 4 months ago

hallgeir-osterbo-visma commented 5 months ago

Describe the bug We have two sets of bicep templates that we deploy using deployment stacks:

The "app" template references resources that are managed by the "shared" template using the existing keyword.

When we deployed the latest version of the "app" template today for one of the environments, the stack tried to delete multiple resources that are NOT managed by the "app" template:

There are more referenced resources as well that were NOT attempted to be deleted, including app service plans and other storage accounts. When I look at the list of managed resources, I can see that the resources that were attempted deleted are NOT present. However, other the referenced resources are there.

THANKFULLY the deletion failed because the "app" template has child resources for these resources. Otherwise we would have a major cleanup on our hands.

We have not made any changes to the referenced resources either lately.

To Reproduce Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

I do not have any clear reproduction steps. This just started happening for us today.

Expected behavior A clear and concise description of what you expected to happen. The deployment stack should NEVER delete referenced resources. Especially not randomly like I'm seeing here.

To me, this seems like a really, really bad bug.

Screenshots If applicable, add screenshots to help explain your problem.

Company policy is to not reveal internal resource names etc. so these have been redacted. If Microsoft needs these, please let me know how I can share it securely.

image

Repro Environment Host OS: Powershell Version:

Server Debugging Information Correlation ID: 8239eed2-717e-4801-b8d7-daf5c7272aea Tenant ID: c166b9c4-5053-4eec-9665-aba0782d0804 Timestamp of issue (please include time zone): Jan 24, 2024, 7:59 AM UTC+1 Data Center (eg, West Central US, West Europe): West Europe, North Europe

Additional context Add any other context about the problem here.

hallgeir-osterbo-visma commented 5 months ago

It happened again on another deployment for another environment, using the same template as we used in the previous deploy. Correlation ID: 81c32826-11ba-427a-92a6-b26e0d2f92ea It attempted to delete different resources this time. Deletion thankfully failed again due to child resources being present in the current deployment stack.

For us, it looks like deployment stacks are completely broken atm.

snarkywolverine commented 5 months ago

@hallgeir-osterbo-visma Are you using symbolic names for codegen, as mentioned here? https://github.com/Azure/deployment-stacks/issues/132#issuecomment-1803317989

The symptoms you describe sound similar to that issue, and the fix has not yet been deployed in North Europe -- which appears to have impacted both correlation IDs you specified.

hallgeir-osterbo-visma commented 5 months ago

@hallgeir-osterbo-visma Are you using symbolic names for codegen, as mentioned here? #132 (comment)

The symptoms you describe sound similar to that issue, and the fix has not yet been deployed in North Europe -- which appears to have impacted both correlation IDs you specified.

Thanks for the reply! Unless symbolicNameCodegen is default true, then we do not use that. We don't have it specified in our bicepConfig.json.

There's not only North Europe resources here though:

I notice @alex-frankel mentioning (quote - with added emphasis):

existing resources are not included in the last of managed resources, so we shouldn't be attempting to delete either acr_a or acr_b. Have you experienced a case where we tried to delete an existing resource? If so, that would be a bug we need to get resolved."

The part I emphasized here - does this mean that this has been a change lately, that resources declared with existing are no longer added as managed resources, but was earlier? Could this transition cause issues like described?

I also see @peter-de-wit mentioning:

I also ran into this. I use deployment stacks as a parent (root) deployment with additional child deployment stack features. When refering to resources within the parent stack, and trying to delete the child stack, it is trying to delete resources from the root stack. But, due the 'cannot delete' functionality within the root stack, this deletion is not completed, luckely. But this does bring up the problem that deleting the child stack is only possible by setting deletion mode on 'deattach' .

A more detailed scenario:

main stack: deploys an automation account.

child stack: deploys variables within the automation account.

Deleting the child stack (with deleteresources) results in an attempt to delete the automation account also.

This looks erringly familiar to what I'm experiencing (although I do not attempt to delete the child stack, I merely deploy it).

alex-frankel commented 5 months ago

There was a bug where existing resources were being added to the list of managed resources by mistake. That's what the fix that is rolling out now is resolving.

The location of your deployment is determined by the location of your resource group (in an rg-scoped deployment), not the location of deployed resources.

Even if you have not opted in explicitly to symbolicNameCodegen, we still switch to it depending on which bicep features you are using. For example, if you are using userDefinedTypes, but there may be other features that implicitly opt you in. To check this, you can look at the generated ARM template. If you see the property languageVersion then you are using the symbolic name codegen.

I'm pretty confident that once the fix is fully rolled out, your issue will be resolved, which hopefully should be very soon. Apologies for the inconvenience this is causing and thank you for bearing with us.

If you'd like, feel free to share the complete bicep file that is triggering the issue and we can try to confirm these details for you.

hallgeir-osterbo-visma commented 5 months ago

There was a bug where existing resources were being added to the list of managed resources by mistake. That's what the fix that is rolling out now is resolving.

The location of your deployment is determined by the location of your resource group (in an rg-scoped deployment), not the location of deployed resources.

Even if you have not opted in explicitly to symbolicNameCodegen, we still switch to it depending on which bicep features you are using. For example, if you are using userDefinedTypes, but there may be other features that implicitly opt you in. To check this, you can look at the generated ARM template. If you see the property languageVersion then you are using the symbolic name codegen.

I'm pretty confident that once the fix is fully rolled out, your issue will be resolved, which hopefully should be very soon. Apologies for the inconvenience this is causing and thank you for bearing with us.

If you'd like, feel free to share the complete bicep file that is triggering the issue and we can try to confirm these details for you.

We do indeed use userDefinedTypes. So then I guess this does apply. Thanks for a thorough reply on this! Regarding the resource group location, for the initial set of resources that were attempted to be deleted, some of them are in a resource group in Norway East, and some (the service bus and storage account) are in a resource group in West Europe. Though the resource group in West Europe has resources from both West and North Europe.

Do you know at this point which regions have received the fix (if any)?

azcloudfarmer commented 5 months ago

Hi @hallgeir-osterbo-visma - the fix has been deployed to all regions with the exception of "North Europe" and "South Central US".

snarkywolverine commented 5 months ago

@hallgeir-osterbo-visma We're expecting the remaining regions to be updated and fixed within the next ~24 hours.

hallgeir-osterbo-visma commented 5 months ago

But then it's a bit odd that this also happens for us in those regions that are NOT North Europe, if it is indeed the same issue. Or is it so that if we deployed to e.g. westeurope BEFORE the fix was rolled out, the resource would be added to the stack then, and then when the fix was rolled out, the resource would be deleted because it's no longer part of the stack?

snarkywolverine commented 5 months ago

@hallgeir-osterbo-visma You are 100% correct -- the resource list was created at deployment time, while the issue was still occurring. When you run the stack now, with the delete flag, it will attempt to delete resources that are no longer part of the stack. The recommended workaround is to re-run the stack state with 'detach' mode to ensure no resources are deleted; subsequent iterations should now work as expected.

The fix has now been released to all regions.

snarkywolverine commented 5 months ago

I went ahead and created #149 - and pinned it - to help raise awareness and clarify the scenario. Let us know if you have any further questions on this.

hallgeir-osterbo-visma commented 5 months ago

@snarkywolverine That's great to hear! Thanks a lot for a detailed update.

We're currently running with detatch. We'll check back and make sure no unexpected resources are detatched before switching back to delete-mode.

azcloudfarmer commented 4 months ago

@hallgeir-osterbo-visma great to hear everything is working on your end! Closing the issue for now. Please let us know if anything changes.