Azure / Enterprise-Scale

The Azure Landing Zones (Enterprise-Scale) architecture provides prescriptive guidance coupled with Azure best practices, and it follows design principles across the critical design areas for organizations to define their Azure architecture
https://aka.ms/alz
MIT License
1.72k stars 975 forks source link

Bug Report: PowerShell deployment runs forever or fails #1793

Open jdrepo opened 1 month ago

jdrepo commented 1 month ago

Describe the bug

Deploy ALZ reference implementation with PowerShell doesn´t work

Steps to reproduce

  1. New-AzTenantDeployment -Name "ALZ-Deployment-$(Get-Date -Format 'yyyyMMddTHHMM')" -Location "westeurope" -TemplateUri "https://raw.githubusercontent.com/Azure/Enterprise-Scale/2024-09-03/eslzArm/eslzArm.json" -TemplateParameterFile ".\ALZ-Portal-parametersFile.json" -WhatIf
  2. Getting the latest status of all resources... forever

Seems to me that there is the same error/behaviour in the "Test Portal experience" workflow in this repo ?

Screenshots

Image

Is this a general problem with PowerShell deployment at the moment ?

Springstone commented 1 month ago

@jdrepo I'm busy testing the release of Policy Refresh and have had many issues today. I don't think it's a PowerShell issue, looks like the ARM engine is throttling or having issues. Can I ask you to try again after a bit?

There is some another issue currently impacting testing that we're investigating - its authentication related.

jdrepo commented 1 month ago

@Springstone Thanks for your reply, yes I can test it again later, I also had the problem yesterday and I also think it has nothing to do with PowerShell, as the problem also occurs with azure cli deployments, so then probably a fundamental problem with the ARM backend

Springstone commented 1 month ago

@jdrepo are you still experiencing issues or can we close the issue?

jdrepo commented 1 month ago

Hi @Springstone didn't try it again. I'm currently out of office, will try it tomorrow and give then a feedback. Please hold the issue open. Thanks

jdrepo commented 1 month ago

Hi @Springstone, tried again today and the issues still occur, seems to me nothing has changed, deployment runs forever if deploying with PowerShell

Springstone commented 1 month ago

@jdrepo could you try run the same deployment without the -WHATIF (we suspect recent changes in pre-flight processes are causing a problem) - or try running again with the latest release of ALZ.

New-AzTenantDeployment -Name "ALZ-Deployment-$(Get-Date -Format 'yyyyMMddTHHMM')" -Location "swedencentral" -TemplateUri "https://raw.githubusercontent.com/Azure/Enterprise-Scale/2024-10-09/eslzArm/eslzArm.json" -TemplateParameterFile ".\parameters.json" -WhatIf -Debug

If I run the above it succeeds, takes about 60 seconds.

jdrepo commented 1 month ago

@Springstone Will give it another try, it seems to me that it depends on the template parameter file, i just tried it with a "small" parameter file with only the enterpriseScaleCompanyPrefix parameter in it , then the -whatif run succeeds. But with a full blown parameter file ( with all parameters set - more than 700 lines ) the deployment times out again. What did you set in your parameters.jsonin your last test ?

jdrepo commented 4 weeks ago

@Springstone Did some more tests and I think found the cause: If I change the deployment location to another region than westeurope the what-if deployment runs without any problems.

I did a test withswedencentral or eastus or northeurope like you did and had no problems (sometimes it took some minutes, sometimes it runs under 60 seconds ), so it seems to me that there indeed some issues with the ARM backend in my preferred region westeurope

Can you try to run your last deploynment against the westeurope region ?

Springstone commented 4 weeks ago

@jdrepo yes, looks like it is westeurope that is having issues with whatif. It actually looks like PowerShell hangs, as I can't terminate/break the deployment either, and running with -debug indicates it hangs pretty quickly. Might be related to restrictions in that region? You may want to open a support ticket for this, and we'll see on our end if we can get someone from engineering to investigate.

jdrepo commented 4 weeks ago

@Springstone yes, I can observe the same behaviour. PowerShell hangs and can´t be terminated. I let the task running and after approx. 1 hour it came back with a lot of error messages

Image

I have no idea why the germanywestcentral region is involved when I deploy the template against the westeurope region. Maybe it reroutes the ARM request to this region, because I´m located in Germany.

Earlier this morning I tried it against the germanywestcentral region that did run without any problem. Now I tried it again and it hangs again, very strange.

I can try to open a support ticket but I don´t know if this environment is covered by a support plan...

`DEBUG: ============================ HTTP RESPONSE ============================

Status Code: OK

Headers: Cache-Control : no-cache Pragma : no-cache x-ms-ratelimit-remaining-tenant-reads: 249 x-ms-request-id : 502dc99b-81a6-4865-89d9-5519e99ef8c1 x-ms-correlation-request-id : 502dc99b-81a6-4865-89d9-5519e99ef8c1 x-ms-routing-request-id : GERMANYWESTCENTRAL:20241015T093338Z:502dc99b-81a6-4865-89d9-5519e99ef8c1 Strict-Transport-Security : max-age=31536000; includeSubDomains X-Content-Type-Options : nosniff X-Cache : CONFIG_NOCACHE X-MSEdge-Ref : Ref A: 7F43E0C7BFB649B7A1D874C76F40E3E0 Ref B: FRA231050415021 Ref C: 2024-10-15T09:33:38Z Date : Tue, 15 Oct 2024 09:33:38 GMT

Body: { "status": "Failed", "error": { "code": "DeploymentWhatIfTimeout", "message": "The request to predict template deployment changes to scope '/' has timed out. Diagnostic information: timestamp '20241015T093337Z', tracking id '57bceef4-2bc1-4523-9dd3-350f66db0e2b', request correlation id 'd69a9d3c-6b34-4fdc-8c59-7f02c3f48e86', location 'germanywestcentral'." } }

DEBUG: 11:33:42 - [ResourceManagerCmdletBase.ExecuteCmdlet] Caught unhandled exception: Microsoft.Rest.Azure.CloudException: DeploymentWhatIfTimeout - Long running operation failed with status 'Failed'. Additional Info:'The request to predict template deployment changes to scope '/' has timed out. Diagnostic information: timestamp '20241015T093337Z', tracking id '57bceef4-2bc1-4523-9dd3-350f66db0e2b', request correlation id 'd69a9d3c-6b34-4fdc-8c59-7f02c3f48e86', location 'germanywestcentral'.' at Microsoft.Azure.Commands.ResourceManager.Cmdlets.SdkClient.NewResourceManagerSdkClient.ExecuteDeploymentWhatIf(PSDeploymentWhatIfCmdletParameters parameters) at Microsoft.Azure.Commands.ResourceManager.Cmdlets.Implementation.CmdletBase.DeploymentWhatIfCmdlet.ExecuteWhatIf() at Microsoft.Azure.Commands.ResourceManager.Cmdlets.Implementation.CmdletBase.DeploymentCreateCmdlet.OnProcessRecord() at Microsoft.Azure.Commands.ResourceManager.Cmdlets.Implementation.ResourceManagerCmdletBase.ExecuteCmdlet() DEBUG: 11:33:42 - [ConfigManager] Got nothing from [EnableErrorRecordsPersistence], Module = [], Cmdlet = []. Returning default value [False]. New-AzTenantDeployment: DeploymentWhatIfTimeout - Long running operation failed with status 'Failed'. Additional Info:'The request to predict template deployment changes to scope '/' has timed out. Diagnostic information: timestamp '20241015T093337Z', tracking id '57bceef4-2bc1-4523-9dd3-350f66db0e2b', request correlation id 'd69a9d3c-6b34-4fdc-8c59-7f02c3f48e86', location 'germanywestcentral'.' DEBUG: 11:33:42 - [ConfigManager] Got nothing from [DisplayBreakingChangeWarning], Module = [], Cmdlet = []. Returning default value [True]. DEBUG: 11:33:42 - [ConfigManager] Got nothing from [DisplayRegionIdentified], Module = [], Cmdlet = []. Returning default value [True]. DEBUG: 11:33:42 - [ConfigManager] Got nothing from [CheckForUpgrade], Module = [], Cmdlet = []. Returning default value [True]. DEBUG: AzureQoSEvent: Module: Az.Resources:7.5.0; CommandName: New-AzTenantDeployment; PSVersion: 7.4.5; IsSuccess: False; Duration: 01:02:29.6872120; SanitizeDuration: 00:00:00; Exception:
DeploymentWhatIfTimeout - Long running operation failed with status 'Failed'. Additional Info:'The request to predict template deployment changes to scope '/' has timed out. Diagnostic information: timestamp '20241015T093337Z', tracking id '57bceef4-2bc1-4523-9dd3-350f66db0e2b', request correlation id 'd69a9d3c-6b34-4fdc-8c59-7f02c3f48e86', location 'germanywestcentral'.'; DEBUG: 11:33:42 - [ConfigManager] Got [True] from [EnableDataCollection], Module = [], Cmdlet = []. DEBUG: 11:33:42 - NewAzureTenantDeploymentCmdlet end processing.`

Springstone commented 4 weeks ago

@jdrepo not to worry, I've opened an internal engineering ticket to investigate the WHATIF issue. However, I do believe that if you remove the WHATIF flag the deployment will proceed/succeed. If you find otherwise, please do let me know.

jdrepo commented 4 weeks ago

@Springstone yes as assumed the "real" deployment does start, so it´s indeed only an issue with --whatif deployments. But during the deployment I now encountered some deployment errors, seems to me that there is a problem with the management groups hierarchy syncronization during the deployment flow. Is that a known issue ?

Image

{"code":"DeploymentFailed","message":"At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/arm-deployment-operations for usage details.","details":[{"code":"InvalidCreatePolicyAssignmentRequest","message":"The policy definition specified in policy assignment 'Deny-MgmtPorts-Internet' is out of scope. Policy definitions should be specified only at or above the policy assignment scope. If the management groups hierarchy changed recently or if assigning a management group policy to new subscription, please allow up to 30 minutes for the hierarchy changes to apply and try again."}]}

Springstone commented 3 weeks ago

@jdrepo any errors that include the text please allow up to 30 minutes for the hierarchy changes to apply and try again. are related to delays in the policy engine registering the policy. We had a mitigation in place to minimize this, but seems the ARM/Policy engines are struggling to keep up - so will be extending the deployment wait time to help minimize this issue.

TLDR Policy isn't available for assignment yet. If you re-run the deployment with the same parameters, it will succeed.

jdrepo commented 3 weeks ago

@Springstone can you give me short notice how and if I can extend deployment wait time ?

For the what-if issue: Did you get an response from internal engineering ? I`ve opened a support call and would it be helpful to link these calls ?

Springstone commented 3 weeks ago

@jdrepo you can't change the wait time yourself, but we've pushed through a patch to increase the wait time an additional couple of minutes, use the latest release (https://github.com/Azure/Enterprise-Scale/tree/2024-10-14).

I am working with engineering on the WHATIF issue but seems something has changed as it suddenly started working this morning. Could you confirm that it is working for you?

jdrepo commented 3 weeks ago

@Springstone isn'it possible to change the wait time if I modify the parameter "delayCount" in the template parameter file ?

Did another WHATIF deployment against the "westeurope" region and the issue still occurs ? What I can´t understand if I start the deplyment against the "westeurope" region, why I get an error message mentioning the "switzerlandnorth" region ?

Image

Image

Springstone commented 1 week ago

@jdrepo yes, you can increase the delayCount, did see it in the portal deployment parameters file - I've been working with other template param files :) for testing that don't include those parameters.

I've confirmed with PG this is a transient issue, which is why it sometimes works and sometimes doesn't (very inconsistent). I ran this on the weekend and 9/10 worked fine, one time it failed with a similar error.

I wouldn't worry about the SwitzerlandNorth, as it could be that an RP or part of ARM is running from there - the error message doesn't indicate an actual issue with deployment, just that it's a long running operation (basically it's timed out).