Azure / deployment-stacks

Contains Deployment Stacks CLI scripts and releases
MIT License
87 stars 6 forks source link

Deployment stack stuck in "Deploying" state #148

Closed pjelar closed 4 months ago

pjelar commented 5 months ago

Describe the bug I'm rolling out changes to our subscription using deployment stacks. After my initial deployment of a stack and decided to make some changes of the network layer. The next deployment stack then failed, I think because some of the infrastructure couldn't be updated in situ but needed to be deleted before rolling out. The deployment failed but the Deployment stack state stayed in 'Deploying' and I can't progress. To Reproduce Steps to reproduce the behavior:

  1. Deploy a virtual network
  2. Make changes to the bicep file with your virtual network and deploy again
  3. The deployment stack will go into a neverending state of Deploying
  4. The deployment will return as failed

Expected behavior Expect the entire deployment stack and deployment to fail.

Screenshots

Screenshot 2024-01-29 at 11 34 51 Screenshot 2024-01-29 at 11 35 07

Repro Environment Host OS: Ubuntu 20 Azure CLI Version: 2.230.0

Server Debugging Information Correlation ID: aa2d5130-cb23-40a4-b82d-959e82418eff Subscription ID: a4dd8d99-3cfd-46b0-b0e6-1955e327b228 Timestamp of issue (please include time zone): 28/01/2024, 21:02:25 Data Center (eg, West Central US, West Europe): Norway East

Additional context ERROR: (DeploymentStackInNonTerminalState) The deployment stack resource '/subscriptions/a4dd8d99-3cfd-46b0-b0e6-1955e327b228/resourceGroups/cryptocust-RG/providers/Microsoft.Resources/deploymentStacks/layer-0' could not be updated as it is currently in a non-terminal state 'Deploying'. Code: DeploymentStackInNonTerminalState

To be clear I attempted to clear the original deployment by wiping the resource group I ran it in and then deployed again but ended up with the same problem.

The old correlation id is: 7bf2bef5-3377-44f9-8f4d-e8cefc381258

kalbert312 commented 5 months ago

Hi @pjelar,

I believe the issue that results in the stack getting stuck in 'Deploying' is caused by an error during deployment (marked as "Conflict" in the provided screenshot):

{
  "error": {
    "code": "DeploymentActive",
    "message": "Unable to edit or replace deployment 'aks-udr-norwayeast': previous deployment from '1/28/2024 12:23:15 AM' is still active (expiration time is '2/4/2024 12:23:14 AM'). Please see https://aka.ms/arm-deploy-resources for usage details."
  }
}

The above is from the old correlation id and the same error type is present in the other correlation id.

The reason it gets stuck in 'Deploying' is because of a bug on our side when attempting to retrieve error details on the deployment. It will attempt to get the related errors of resources in the deployment and nested deployments. Typically, the resource id of an error is not the deployment itself, but in the case of this "DeploymentActive" error code, it is itself and thus causes a hang.

It will require a patch on our side to handle this situation when a deployment with the same id is already running.

Are you able to try deploying the stack again after verifying all deployments within the stack are not in a running state (e.g. 'Deploying', etc) until a fix is deployed?

pjelar commented 5 months ago

Hi Kyle,

I did try re-running the deployments in various ways by removing the resource group and starting again.

How can I check if a deployment is still running? I'm in a bit of a chicken and egg that I don't have access to check anything from the cli and the azure portal isn't showing me any deployments after I wiped the resource group. The ones I did see were stuck for over 24hrs so neverending.

I've tried renaming some resources and kicked off another deployment. This appears to be stuck now but still in a deploying state, I will keep you posted!

correlation id: ad309720-d6b7-47e7-9734-db5b7b90adb7 "id": "/subscriptions/a4dd8d99-3cfd-46b0-b0e6-1955e327b228/resourceGroups/cryptocust-RG",

On Mon, 29 Jan 2024 at 18:46, Kyle Albert @.***> wrote:

Hi @pjelar https://github.com/pjelar,

I believe the issue that results in the stack getting stuck in 'Deploying' is caused by an error during deployment (marked as "Conflict" in the provided screenshot):

{ "error": { "code": "DeploymentActive", "message": "Unable to edit or replace deployment 'aks-udr-norwayeast': previous deployment from '1/28/2024 12:23:15 AM' is still active (expiration time is '2/4/2024 12:23:14 AM'). Please see https://aka.ms/arm-deploy-resources for usage details." } }

The above is from the old correlation id and the same error type is present in the other correlation id.

The reason it gets stuck in 'Deploying' is because of a bug on our side when attempting to retrieve error details on the deployment. It will attempt to get the related errors of resources in the deployment and nested deployments. Typically, the resource id of an error is not the deployment itself, but in the case of this "DeploymentActive" error code, it is itself and thus causes a hang.

It will require a patch on our side to handle this situation when a deployment with the same id is already running. Are you able to try deploying the stack again after verifying all deployments within the stack are not in a running state (e.g. 'Deploying', etc) until a fix is deployed?

— Reply to this email directly, view it on GitHub https://github.com/Azure/deployment-stacks/issues/148#issuecomment-1915349825, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHGC4HVHYB5L4GBG23ZLN3YQ7VAXAVCNFSM6AAAAABCPIY5NWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJVGM2DSOBSGU . You are receiving this because you were mentioned.Message ID: @.***>

kalbert312 commented 5 months ago

I'm in a bit of a chicken and egg that I don't have access to check anything from the cli and the azure portal isn't showing me any deployments after I wiped the resource group. The ones I did see were stuck for over 24hrs so neverending.

If there are stuck deployments or stacks, try cancelling first where applicable.

If it is the case that there are no related deployments with the same name as deployments within the stack, then it sounds like there could be deployments in the stack template itself that could be clashing with each other via name.

Try checking all Bicep module names and Bicep resources that are deployments in the template for possible overlaps.

For example, one way that this could happen is:

main.bicep

module foo 'foo.bicep' = [for i in range(0, 10): {
  name: 'uniqueName${i}'
}]

foo.bicep

module inner 'naming.bicep' = {
  name: 'isThisUnique' // <----- because of the loop in the parent template, this is a problem
}

One way to solve the above example is to pass in a parameter to the "foo" module that is the i index and include that in the name.

If there are several layers of modules, cross check all of them and make sure they have unique names.

Another approach could be to use uniqueString in the names. It is important that the seed passed into the function is the same across deployments so module names/resource names stay the same across future stack deployments.

snarkywolverine commented 5 months ago

@pjelar We have made a change to handle the bug that Kyle mentioned; that should roll out over the next week or so.

snarkywolverine commented 4 months ago

@pjelar The change has been rolled out to all regions, if you want to try again. Let us know if you have any questions.