Azure / deployment-stacks

Contains Deployment Stacks CLI scripts and releases
MIT License
87 stars 7 forks source link

Removing managed cluster agent pool deleted in azure but deployment stack reported error #51

Closed damienpontifex closed 2 years ago

damienpontifex commented 2 years ago

Describe the bug I removed a managed cluster (AKS) agent pool from my template and tested to ensure it was cleaned up in azure. It was reporting as being in "Deleting" state from AKS node pool blade and did delete from there. The stack reported Error: Resource could not be deleted. It is now detached. with the failed resource being the agent pool that was removed

To Reproduce Steps to reproduce the behavior:

  1. Deploy stack with AKS cluster having two node pools, one system and one user - the user node pool resource defined in ARM template as a Microsoft.ContainerService/managedClusters/agentPools resource and deployment stack with UpdateBehavior = purgeResources
  2. Remove the Microsoft.ContainerService/managedClusters/agentPools resource from the ARM template and update stack with Set-AzSubscriptionDeploymentStack
  3. Observe in Azure portal within AKS cluster resource ➝ Node Pools that the User node pool is in a deleting state and deployment stack from pwsh still going
  4. Notice that node pool gets deleted from AKS node pools
  5. Observe deployment stack output like
DetachedResources : /subscriptions/<susbscription-id>/resourceGroups/aks-cluster/providers/Microsoft.ContainerService/managedClusters/mycluster/agentPools/linux2
FailedResources   : Id:   /subscriptions/<susbscription-id>/resourceGroups/aks-cluster/providers/Microsoft.ContainerService/managedClusters/mycluster/agentPools/linux2
                    Error:  Resource could not be deleted. It is now detached.
DeploymentId      : /subscriptions/<susbscription-id>/providers/Microsoft.Resources/deployments/aks-2022-01-07-05-04-03-00851
Error             : One or more stages of deploymentStack update failed.

Expected behavior Deployment stack to report success (with no errors) as the resource was successfully deleted from azure

Screenshots N/A

Repro Environment Host OS: macOS 12.1 Powershell Version: 7.2.1

Server Debugging Information Correlation ID: 89b8c220-07d5-4132-810f-1ba73e4fcb8c Tenant ID: ff2b9041-8733-4fbd-a4e6-23f30567c4a4 Timestamp of issue (please include time zone): (stack snapshot creation time) CreationTime(UTC) : 7/1/2022 5:04:03 am Data Center (eg, West Central US, West Europe): Australia East

Additional context N/A

TSunny007 commented 2 years ago

Thanks for filing this! It probably has to do with the underlying delete endpoint we rely on for resource deletion so they'll appreciate this find. I will investigate this further and keep you updated

miqm commented 2 years ago

👍🏻 I confirm, stumbled onto this one also.

snarkywolverine commented 2 years ago

@miqm Do you have a correlation ID (or subscription, stack, and deployment date) we can use to verify the issue is the same in both cases?

miqm commented 2 years ago

Unfortunately not anymore. But I could try to reproduce it.

snarkywolverine commented 2 years ago

@damienpontifex @miqm We managed to track down an issue with the way delete requests were being handled. Can you see if you still encounter the error for this scenario now?

azcloudfarmer commented 2 years ago

@damienpontifex @miqm if you don't have time to validate/verify, can you share a template for us to repro?

miqm commented 2 years ago

@apclouds sorry, I don't have time now to verify it :( But here is my sample bicep code for aks. on the second deploy, remove one of the userpool resources.

param aksName string

var defaultPool = {
  name: 'system'
  properties: {
    count: 1
    osType: 'Linux'
    mode: 'System'
    vmSize: 'Standard_B2ms'
  }
}
resource aksCluster 'Microsoft.ContainerService/managedClusters@2021-10-01' = {
  name: aksName
  location: resourceGroup().location
  identity: {
    type: 'SystemAssigned'
  }
  properties: {
    kubernetesVersion: '1.22.4'
    dnsPrefix: 'dnsprefix'
    enableRBAC: true
    nodeResourceGroup: '${aksName}-cluster-rg'
    agentPoolProfiles:[
      union({
        name: defaultPool.name
      }, defaultPool.properties)
    ]
  }
  resource system 'agentPools' = {
    name: defaultPool.name
    properties: defaultPool.properties
  }

  resource userpool1 'agentPools' = {
    name: 'userpool1'
    properties: {
      count: 1
      osType: 'Linux'
      mode: 'User'
      vmSize: 'Standard_B2s'
      availabilityZones: [
        '1'
      ]
    }
  }

  resource userpool2 'agentPools' = {
    name: 'userpool2'
    properties: {
      count: 1
      osType: 'Linux'
      mode: 'User'
      vmSize: 'Standard_B2s'
      availabilityZones: [
        '2'
      ]
    }
  }
}
azcloudfarmer commented 2 years ago

@damienpontifex apologies for the delay. We were able to deploy the template and follow the steps to remove the nodepools through the updateBehavior: purgeResources capability for Deployment Stacks with succeeded state. It looks like the issue you ran into was fixed. Can you verify on your end?

snarkywolverine commented 2 years ago

Based on @apclouds validation, I'm going to resolve this issue. @damienpontifex or @miqm - if you run into this again, please reactivate and we'll be happy to take a look.