Azure / deployment-stacks

Contains Deployment Stacks CLI scripts and releases
MIT License
87 stars 6 forks source link

Resources reported as failed to delete (DeploymentStackDeleteResourcesFailed), however resources are deleted but remaining in stack, and no more error details apart from correlation ID. #124

Closed hallgeir-osterbo-visma closed 6 months ago

hallgeir-osterbo-visma commented 10 months ago

Describe the bug We're testing out deployment stacks for our infrastructure. Last night we removed a bunch of resources from our template, and in our CD pipeline we run: az stack group create --resource-group ourresourcegroup --name app-dev --template-file .\main.bicep --parameters whatever.json --yes --deny-settings-mode None --delete-resources

We then get an error (the timestamp is in UTC+2):

August 23rd 2023 05:25:06 ERROR: (DeploymentStackDeleteResourcesFailed) One or more resources could not be deleted. Correlation id: '983f5032-4277-4ef8-8f1e-524b7672c768'. 
Code: DeploymentStackDeleteResourcesFailed 
Message: One or more resources could not be deleted. Correlation id: '983f5032-4277-4ef8-8f1e-524b7672c768'. 
Exception Details:  (DeploymentStackDeleteResourcesFailed) An unknown error occurred while trying to delete resources. These resources are still present in the stack but can be deleted manually. 
    Code: DeploymentStackDeleteResourcesFailed 
    Message: An unknown error occurred while trying to delete resources. These resources are still present in the stack but can be deleted manually. 

I then go and check the resources in Azure. The resources are correctly deleted -- only the ones that remains in our template is still there, as we would expect. UPDATE: MOST resources are deleted, but some remain. This includes: Role assignments, private DNS zone records (e.g. CNAME records). I've not been able to reproduce that these resources could not be deleted in a smaller environment. All resources, however, are still part of the deployment stack. I repeated the command today, and it fails again with the same error.

In the list of "Failed" resources I see this error for each resource: {"code":"DeletionFailed","message":"Resource could not be deleted. Resource is still present in stack."}

To Reproduce I'm not sure what triggers this behavior -- it happened just randomly for us. I have seen it once before, when I was testing out on a smaller template. What I did (roughly):

  1. Create a template with an app service plan and a couple of app services.
  2. Deploy it with az stack group create
  3. Comment out one of the appservices, deploy again
  4. remove comments, deploy again
  5. try deleting the stack with --delete-resources Just fiddling around with resource deletion, really. Suddenly it failed with the same error as above. I can't guarantee that this exact sequence reproduces it, I was just testing a bunch of creations/deletions.

Template which I were able to use to reproduce (Not the same as the one in the error report above!):

param webAppName string = uniqueString(resourceGroup().id) // Generate unique String for web app name
param sku string = 'P1v2' // The SKU of App Service Plan
param location string = resourceGroup().location // Location for all resources
var appServicePlanName = toLower('AppServicePlan-${webAppName}')
var webSiteName = toLower('wapp-${webAppName}')

resource appServicePlan 'Microsoft.Web/serverfarms@2020-06-01' = {
  name: appServicePlanName
  location: location
  properties: {
    reserved: true
  }
  sku: {
    name: sku
  }
}

resource appService 'Microsoft.Web/sites@2020-06-01' = {
  name: webSiteName
  location: location
  properties: {
    serverFarmId: appServicePlan.id
    siteConfig: {
      netFrameworkVersion: 'v6.0'
    }
  }
}

resource anotherAppService 'Microsoft.Web/sites@2020-06-01' = {
  name: '${webSiteName}-another'
  location: location
  properties: {
    serverFarmId: appServicePlan.id
    siteConfig: {
      netFrameworkVersion: 'v6.0'
    }
  }
}

Expected behavior I would expect the resources to be removed from the deployment stack and not failing. The resources are after all correctly removed.

Screenshots If applicable, add screenshots to help explain your problem.

Repro Environment Host OS: Powershell Version:

> az version
{
  "azure-cli": "2.51.0",
  "azure-cli-core": "2.51.0",    
  "azure-cli-telemetry": "1.1.0",
  "extensions": {
    "bastion": "0.2.5"
  }
}

Running on Windows 10.

Server Debugging Information Correlation ID: 983f5032-4277-4ef8-8f1e-524b7672c768 Tenant ID: c166b9c4-5053-4eec-9665-aba0782d0804 Timestamp of issue (please include time zone): August 23rd 2023 05:25:06, GMT+2 / UTC+2 Data Center (eg, West Central US, West Europe): Deployment stack location: Norway East. Resources that were deleted: West Europe.

Additional context Add any other context about the problem here.

harshpatel17 commented 10 months ago

Hello, thanks for describing the issue. I'll try to reproduce the issue on my end and get back to you with my findings.

hallgeir-osterbo-visma commented 10 months ago

Is there some way to get some more info based on the correlation ID? I tried finding a way to open a support ticket for this, but couldn't find a fitting category (I guess because it's still in preview).

UPDATE (updated main post as well): It turns out that all resources were not deleted -- some remains, including role assignments and some private dns records. However, ALL resources that were supposed to deleted - both those that actually were deleted, and those that were not, are listed under the "Failed" section under Resources in the deployment stack.

I've also not been able to repro this issue on a smaller template after I made this issue (but I was able to with the attached template earlier -- I have no idea what triggers it however).

dantedallag commented 10 months ago

@hallgeir-osterbo-visma I looked into the correlation id and it appears that the root of the issue is a subnet failing to delete because of child resources still existing in the subnet. Are there resources outside of the stack belonging to a subnet that the stack is trying to delete?

The fact that this error is not being properly communicated to you is definitely a bug we will file a fix for. If you would like to discuss more privately about the particular failures for this deployment, we can set up communication with you via email.

hallgeir-osterbo-visma commented 10 months ago

@dantedallag That's interesting indeed. But we don't create or update a subnet or vnet as part of this deployment stack. However, we do reference a VNET (using resource whatever "..." existing = {). The VNET are created in a different deployment stack.

Could it be it's attempting to delete referenced resources as well? After all it does list the VNET in the deployment stack. If that is the case, that's a bit scary, as I'd expect only resources that are created by that particular deployment stack would be deleted by destroying the stack.

Is this so? Is it a bug? If it's intended, is there any way to avoid attempting deleting resources that are just referenced with existing and not created by the stack?

dantedallag commented 10 months ago

@hallgeir-osterbo-visma: For the original failure, was the initial stack created that is being updated created recently? I ask because we did have a known bug with deletion of referenced resources, which could be showing up if the original stack is old enough. Otherwise, this may be something different that we would like to investigate more into. The stack should not be attempting to delete referenced resources.

As for your reproduction, would it be possible to get a correlation id for what you think may be a similar failure? I would like to confirm that the underlying failures of these two scenarios are linked.

hallgeir-osterbo-visma commented 10 months ago

@dantedallag The stack was created around the 18th of August (may be one or two days earlier though). Do you think this bug could be what's affecting it? If so, what's the best course of action? Delete the stack without deleting resources?

For the reproduction: I don't have a correlation ID unfortunately -- I've attempted to dig it up but haven't been able to.

dantedallag commented 10 months ago

@hallgeir-osterbo-visma That is recent enough that it shouldn't be an issue. Let us take a closer look at it. I will attempt to do a repro with a similar template to the one you provided.

hallgeir-osterbo-visma commented 10 months ago

@dantedallag That's great! Thank you!

Would be OK for us to delete the deployment stack and recreate it so that we can resume using stacks in our deployment pipeline? Or do you need us to keep the deployment stack for further troubleshooting?

dantedallag commented 10 months ago

@hallgeir-osterbo-visma Yeah, you can delete it and continue. If you run into any issue that appears to be related, please let us know in this thread!

hallgeir-osterbo-visma commented 9 months ago

It has happened again, this time on a completely new stack. The deployment stack was created yesterday (though resources were already there), and today when we deployed the same bicep template, it failed with the same error.

{
  "code": "DeploymentStackDeleteResourcesFailed",
  "message": "One or more resources could not be deleted. Correlation id: '0d828045-14c3-4756-a2dd-9da16504403a'.",
  "details": [
    {
      "code": "DeploymentStackDeleteResourcesFailed",
      "message": "An unknown error occurred while trying to delete resources. These resources are still present in the stack but can be deleted manually."
    }
  ]
}

One thing we did, that I don't know if could have triggered this or not, was to remove some role assignments beforehand that are managed by the deployment stack. I had to do this because we were renaming the role assignments in the bicep code, and attempting to deploy caused conflicts because the same role assignments existed with a different name. I hoped that the deployment stack would remove the role assignments first since we removed them from the bicep, but that didn't seem to be the case.

Interestingly we have another environment where we did the exact same thing, and here everything is OK -- the deployment stack deployment is as green as it can be.

Is there any info in the correlation ID here?

azcloudfarmer commented 9 months ago

@hallgeir-osterbo-visma can you please share the templates used for repro?

hallgeir-osterbo-visma commented 9 months ago

Unfortunately I can't because of company policy... I will see if I can make a more minimal repro based on the steps that we did here.

A bit more info -- when I inspect the deployment stack, under the "Failed" section in the resource list, are only role assignments. So I suspect the bug now is related to the role assignment cleanup we just did.

It would be great to know what's hidden behind the correlation ID.

hallgeir-osterbo-visma commented 9 months ago

Absolutely no luck with reproducing this with a smaller template unfortunately... Perhaps there's some race condition or something that causes this? The template we're deploying as part of our application deployment is quite large, with a lot of nested modules.

hallgeir-osterbo-visma commented 9 months ago

Another error:

{
  "code": "DeploymentStackDeleteResourcesFailed",
  "message": "One or more resources could not be deleted. Correlation id: '9bf823a2-b66c-416d-af94-d291957fa958'.",
  "details": [
    {
      "code": "DeploymentStackDeleteResourcesFailed",
      "message": "An unknown error occurred while trying to delete resources. These resources are still present in the stack but can be deleted manually."
    }
  ]
}

What we did before this failed, is conditionally removing some private endpoints from the infrastructure. Could someone with access have a look at what's going on behind the correlation ID?

azcloudfarmer commented 9 months ago

Hello @hallgeir-osterbo-visma we are working on improving the error messaging here. For tracking, this is our internal work item tracking this: https://msazure.visualstudio.com/One/_workitems/edit/25009347

hallgeir-osterbo-visma commented 9 months ago

Hello @hallgeir-osterbo-visma we are working on improving the error messaging here. For tracking, this is our internal work item tracking this: https://msazure.visualstudio.com/One/_workitems/edit/25009347

That's great to hear!

In the meantime, could you (or someone) help me getting the actual errors from the correlation IDs I posted? These are the correlation IDs:

azcloudfarmer commented 9 months ago

Hi @hallgeir-osterbo-visma - thanks for the IDs and detail. We will look into these and get back to you within the next 72 hours.

alex-frankel commented 9 months ago

I also got hit by this one, so just adding another correlation ID to the case:

{
  "code": "DeploymentStackDeleteResourcesFailed",
  "message": "One or more resources could not be deleted. Correlation id: '5d290e25-1a57-4b72-8542-a43c84aaf8e9'.",
  "details": [
    {
      "code": "DeploymentStackDeleteResourcesFailed",
      "message": "An unknown error occurred while trying to delete resources. These resources are still present in the stack but can be deleted manually."
    }
  ]
}
PedramRjoo commented 9 months ago

Same here:

{
  "code": "DeploymentStackDeleteResourcesFailed",
  "message": "One or more resources could not be deleted. Correlation id: 'e6e902df-1393-417f-b378-c46246192109'.",
  "details": [
    {
      "code": "DeploymentStackDeleteResourcesFailed",
      "message": "An unknown error occurred while trying to delete resources. These resources are still present in the stack but can be deleted manually."
    }
  ]
}

Edit: Once i re-deployed it, i was enable to delete the stack again

ouldsid commented 8 months ago

I'm trying to repro this issue but so far no luck to do so

azcloudfarmer commented 8 months ago

Hi @hallgeir-osterbo-visma we are still working on this issue. @ouldsid and I plan for an update here on this thread by end of week.

RaduG commented 8 months ago

Hi, we have this problem with the same template applied across multiple branches. Clean-up suddenly stopped working and all we see is

An unknown error occurred while trying to delete resources. These resources are still present in the stack but can be deleted manually. (Code: DeploymentStackDeleteResourcesFailed)

~At least some of the~ All resources have actually been deleted but they still show up in the deployment stack.

Correlation ids:

Thanks!

RaduG commented 8 months ago

@azcloudfarmer @ouldsid hey, do you have any updates on this issue?

azcloudfarmer commented 8 months ago

Hi @RaduG, @hallgeir-osterbo-visma - quick update, we are targeting December for the rollout of this fix.

hallgeir-osterbo-visma commented 8 months ago

@azcloudfarmer That's great news!

Out of curiosity, what exactly are being fixed? I don't think I ever saw a response for the correlation IDs. Is it the error messages that will be improved in December?

snarkywolverine commented 8 months ago

@hallgeir-osterbo-visma That is correct - an issue was found that prevented the deployment stack from propagating a more specific error to the user (and on a per-resource basis). With this change, resource-specific errors will be found in the FailedResources property.

hallgeir-osterbo-visma commented 8 months ago

@snarkywolverine That will be very helpful. Looking forward to the fix!

snarkywolverine commented 6 months ago

image

The improved error messaging is now live in all public Azure regions. I've omitted resource-specific information above, but you can see that FailedResources now identifies the resource, and the reason for the failure (in this case, a resource lock).

We'll work to tweak the error text and make it clear that more information is available in FailedResources (since right now it implies the error is unknown).

hallgeir-osterbo-visma commented 6 months ago

I noticed the improved error messages here the other day, and that has been super helpful already. Thanks a lot for this improvement!