Azure / deployment-stacks

Contains Deployment Stacks CLI scripts and releases
MIT License
89 stars 7 forks source link

Resource deletion fails with error "Resource could not be deleted. Resource is still present in stack" but nothing else #158

Closed hallgeir-osterbo-visma closed 6 months ago

hallgeir-osterbo-visma commented 7 months ago

Describe the bug A clear and concise description of what the bug is.

We have a rather large Bicep template that we deploy through a deployment stack. Some resources (app services, and multiple dependent services) are created conditionally, based on the value of a variable. We turned that variable to false, and expected that the resources in question would be deleted (like they have been earlier in other environments where we have done exactly the same). But instead, resource deletion failed on all the resources that should have been deleted. None seemed to be successful. ALL the failures gave the following error:

{
  "code": "DeletionFailed",
  "message": "Resource could not be deleted. Resource is still present in stack."
}

The following error is also reported, for the whole deployment stack:

{
  "code": "DeploymentStackDeleteResourcesFailed",
  "message": "One or more resources could not be deleted. Correlation id: '7779ff0e-35cb-4f30-9b04-b4e58a5b1ee7'.",
  "details": [
    {
      "code": "DeploymentStackDeleteResourcesFailed",
      "message": "An unknown error occurred while trying to delete resources. These resources are still present in the stack but can be deleted manually."
    }
  ]
}

One thing I noticed: If I checked the deployment stack "Managed resources" while the deletion process was going on, I saw that the resources in question was NOT there, as I would expect, since they were not deployed by the template. However, after the deletion failed, the resources in question are back in "Managed resources" as well as listed under "Failed deletions". Probably because they were not actually deleted, so they should be in managed resources? Assuming so, but mentioning it anyway in case it's relevant.

It would be great to know what more is hiding behind the correlation ID.

To Reproduce I don't have any clear reproduction steps because I don't know why this suddenly started happening. All we did was:

  1. Deploy with the variable that controls existence of those resources as true (i.e. all resources are deployed).
  2. Set the variable to false to disable deployment of those resources -- which should then be deleted.

Expected behavior I'd expect the resources that no longer is created with the condition, or if there is a "valid" reason to fail, it should show a proper error message stating WHY it failed.

Screenshots If applicable, add screenshots to help explain your problem.

Repro Environment Host OS: Windows Server 2019 Powershell Version:

Server Debugging Information Correlation ID: 7779ff0e-35cb-4f30-9b04-b4e58a5b1ee7 Tenant ID: c166b9c4-5053-4eec-9665-aba0782d0804 Timestamp of issue (please include time zone): Mar 8, 2024, 1:14 PM UTC+1 Data Center (eg, West Central US, West Europe): West Europe

Additional context Add any other context about the problem here.

I cannot share the template on a public forum. However, if there's a secure way I can share it, IF it is relevant, let me know.

hallgeir-osterbo-visma commented 6 months ago

An additional comment: If I delete one of the resources manually, and then re-run the deployment with the deployment stack, then that resource still shows up in the list of failed deletions.

snarkywolverine commented 6 months ago

Hi @hallgeir-osterbo-visma!

It looks like three resources failed in the correlation ID you provided - KeyVault secrets and a Microsoft.Web/site.

The KeyVault secrets are a known issue - see #142 - and we are still working on improvements there. While the stack still won't be able to delete KV secrets, the stack should not list them as a managed resource if they're already deleted.

The other issue - Microsoft.Web/site - looks like a different issue. I would expect that would be resolved as of earlier today (see issue #159), and it's also something we are working to prevent in the future.

Let me know if you have any further questions; if not, I'd like to close this and track the solutions as part of the other bugs. Let me know if you have any concerns with that plan.

hallgeir-osterbo-visma commented 6 months ago

Ok, thanks for looking into the correlation ID! I will try removing those secrets first.

But it's a bit odd - I usually get proper errors for WHAT went wrong - that was not the case this time. One thing is the actual removal of the resources failing, another thing is having proper error messages so that I can actually figure out WHY it fails and potentially how to fix it.

Any chance to keep this open to look into why I don't get any error messages other than "Resource could not be deleted"? And I also wonder why so many related resources failed to delete. You mentioned you only see 4 resources that failed to be deleted, I see 38, which includes private DNS entries, public DNS zone entries, front door origin groups and more.

hallgeir-osterbo-visma commented 6 months ago

I removed all the secrets and app services in the stack now, and now it behaved a bit different:

On the positive side, the error messages for those two secrets are now showing up.

Correlation ID: 6af9d192-2a22-4717-aba0-f755ba64637f

I will retry the deployment yet again and see if it has noticed the keys are gone by then.

snarkywolverine commented 6 months ago

Sorry, I recognize I wasn't clear earlier, when I said:

The KeyVault secrets are a known issue - see https://github.com/Azure/deployment-stacks/issues/142 - and we are still working on improvements there. While the stack still won't be able to delete KV secrets, the stack should not list them as a managed resource if they're already deleted.

That's the future/planned behavior, rather than the current behavior. Right now, the only way to remove KV secrets is to re-run the stack with the detach flag (assuming the template hasn't changed, and only KV secret resources are failing).

hallgeir-osterbo-visma commented 6 months ago

Right! Thanks for the clarification on that.

Still curious about why the error messages were not reflecting what actually was failing though. Seems to be better now, I re-ran one of the deployments that failed, without deleting secrets first yesterday, and it deleted everything but the secrets. So that's good. Was something fixed on the backend in the meantime?

dantedallag commented 6 months ago

@hallgeir-osterbo-visma The issue with the error not surfacing is a bug that is being tracked and we are currently looking into it. It may have been something transient that we are still looking to confirm. If you were able to proceed without hitting this error, I think we can close this issue. If you face it in the future, feel free to open another issue.

hallgeir-osterbo-visma commented 6 months ago

@dantedallag If it's already being tracked then I'm happy and this can be closed. 👍 Thank you all!