Open jdrepo opened 1 month ago
@jdrepo I'm busy testing the release of Policy Refresh and have had many issues today. I don't think it's a PowerShell issue, looks like the ARM engine is throttling or having issues. Can I ask you to try again after a bit?
There is some another issue currently impacting testing that we're investigating - its authentication related.
@Springstone Thanks for your reply, yes I can test it again later, I also had the problem yesterday and I also think it has nothing to do with PowerShell, as the problem also occurs with azure cli deployments, so then probably a fundamental problem with the ARM backend
@jdrepo are you still experiencing issues or can we close the issue?
Hi @Springstone didn't try it again. I'm currently out of office, will try it tomorrow and give then a feedback. Please hold the issue open. Thanks
Hi @Springstone, tried again today and the issues still occur, seems to me nothing has changed, deployment runs forever if deploying with PowerShell
@jdrepo could you try run the same deployment without the -WHATIF (we suspect recent changes in pre-flight processes are causing a problem) - or try running again with the latest release of ALZ.
New-AzTenantDeployment -Name "ALZ-Deployment-$(Get-Date -Format 'yyyyMMddTHHMM')" -Location "swedencentral" -TemplateUri "https://raw.githubusercontent.com/Azure/Enterprise-Scale/2024-10-09/eslzArm/eslzArm.json" -TemplateParameterFile ".\parameters.json" -WhatIf -Debug
If I run the above it succeeds, takes about 60 seconds.
@Springstone Will give it another try, it seems to me that it depends on the template parameter file, i just tried it with a "small" parameter file with only the enterpriseScaleCompanyPrefix
parameter in it , then the -whatif run succeeds. But with a full blown parameter file ( with all parameters set - more than 700 lines ) the deployment times out again.
What did you set in your parameters.json
in your last test ?
@Springstone Did some more tests and I think found the cause: If I change the deployment location to another region than westeurope
the what-if deployment runs without any problems.
I did a test withswedencentral
or eastus
or northeurope
like you did and had no problems (sometimes it took some minutes, sometimes it runs under 60 seconds ), so it seems to me that there indeed some issues with the ARM backend in my preferred region westeurope
Can you try to run your last deploynment against the westeurope
region ?
@jdrepo yes, looks like it is westeurope
that is having issues with whatif. It actually looks like PowerShell hangs, as I can't terminate/break the deployment either, and running with -debug indicates it hangs pretty quickly. Might be related to restrictions in that region?
You may want to open a support ticket for this, and we'll see on our end if we can get someone from engineering to investigate.
@Springstone yes, I can observe the same behaviour. PowerShell hangs and can´t be terminated. I let the task running and after approx. 1 hour it came back with a lot of error messages
I have no idea why the germanywestcentral
region is involved when I deploy the template against the westeurope
region.
Maybe it reroutes the ARM request to this region, because I´m located in Germany.
Earlier this morning I tried it against the germanywestcentral
region that did run without any problem.
Now I tried it again and it hangs again, very strange.
I can try to open a support ticket but I don´t know if this environment is covered by a support plan...
`DEBUG: ============================ HTTP RESPONSE ============================
Status Code: OK
Headers: Cache-Control : no-cache Pragma : no-cache x-ms-ratelimit-remaining-tenant-reads: 249 x-ms-request-id : 502dc99b-81a6-4865-89d9-5519e99ef8c1 x-ms-correlation-request-id : 502dc99b-81a6-4865-89d9-5519e99ef8c1 x-ms-routing-request-id : GERMANYWESTCENTRAL:20241015T093338Z:502dc99b-81a6-4865-89d9-5519e99ef8c1 Strict-Transport-Security : max-age=31536000; includeSubDomains X-Content-Type-Options : nosniff X-Cache : CONFIG_NOCACHE X-MSEdge-Ref : Ref A: 7F43E0C7BFB649B7A1D874C76F40E3E0 Ref B: FRA231050415021 Ref C: 2024-10-15T09:33:38Z Date : Tue, 15 Oct 2024 09:33:38 GMT
Body: { "status": "Failed", "error": { "code": "DeploymentWhatIfTimeout", "message": "The request to predict template deployment changes to scope '/' has timed out. Diagnostic information: timestamp '20241015T093337Z', tracking id '57bceef4-2bc1-4523-9dd3-350f66db0e2b', request correlation id 'd69a9d3c-6b34-4fdc-8c59-7f02c3f48e86', location 'germanywestcentral'." } }
DEBUG: 11:33:42 - [ResourceManagerCmdletBase.ExecuteCmdlet] Caught unhandled exception: Microsoft.Rest.Azure.CloudException:
DeploymentWhatIfTimeout - Long running operation failed with status 'Failed'. Additional Info:'The request to predict template deployment changes to scope '/' has timed out. Diagnostic information: timestamp '20241015T093337Z', tracking id '57bceef4-2bc1-4523-9dd3-350f66db0e2b', request correlation id 'd69a9d3c-6b34-4fdc-8c59-7f02c3f48e86', location 'germanywestcentral'.'
at Microsoft.Azure.Commands.ResourceManager.Cmdlets.SdkClient.NewResourceManagerSdkClient.ExecuteDeploymentWhatIf(PSDeploymentWhatIfCmdletParameters parameters)
at Microsoft.Azure.Commands.ResourceManager.Cmdlets.Implementation.CmdletBase.DeploymentWhatIfCmdlet.ExecuteWhatIf()
at Microsoft.Azure.Commands.ResourceManager.Cmdlets.Implementation.CmdletBase.DeploymentCreateCmdlet.OnProcessRecord()
at Microsoft.Azure.Commands.ResourceManager.Cmdlets.Implementation.ResourceManagerCmdletBase.ExecuteCmdlet()
DEBUG: 11:33:42 - [ConfigManager] Got nothing from [EnableErrorRecordsPersistence], Module = [], Cmdlet = []. Returning default value [False].
New-AzTenantDeployment:
DeploymentWhatIfTimeout - Long running operation failed with status 'Failed'. Additional Info:'The request to predict template deployment changes to scope '/' has timed out. Diagnostic information: timestamp '20241015T093337Z', tracking id '57bceef4-2bc1-4523-9dd3-350f66db0e2b', request correlation id 'd69a9d3c-6b34-4fdc-8c59-7f02c3f48e86', location 'germanywestcentral'.'
DEBUG: 11:33:42 - [ConfigManager] Got nothing from [DisplayBreakingChangeWarning], Module = [], Cmdlet = []. Returning default value [True].
DEBUG: 11:33:42 - [ConfigManager] Got nothing from [DisplayRegionIdentified], Module = [], Cmdlet = []. Returning default value [True].
DEBUG: 11:33:42 - [ConfigManager] Got nothing from [CheckForUpgrade], Module = [], Cmdlet = []. Returning default value [True].
DEBUG: AzureQoSEvent: Module: Az.Resources:7.5.0; CommandName: New-AzTenantDeployment; PSVersion: 7.4.5; IsSuccess: False; Duration: 01:02:29.6872120; SanitizeDuration: 00:00:00; Exception:
DeploymentWhatIfTimeout - Long running operation failed with status 'Failed'. Additional Info:'The request to predict template deployment changes to scope '/' has timed out. Diagnostic information: timestamp '20241015T093337Z', tracking id '57bceef4-2bc1-4523-9dd3-350f66db0e2b', request correlation id 'd69a9d3c-6b34-4fdc-8c59-7f02c3f48e86', location 'germanywestcentral'.';
DEBUG: 11:33:42 - [ConfigManager] Got [True] from [EnableDataCollection], Module = [], Cmdlet = [].
DEBUG: 11:33:42 - NewAzureTenantDeploymentCmdlet end processing.`
@jdrepo not to worry, I've opened an internal engineering ticket to investigate the WHATIF issue. However, I do believe that if you remove the WHATIF flag the deployment will proceed/succeed. If you find otherwise, please do let me know.
@Springstone yes as assumed the "real" deployment does start, so it´s indeed only an issue with --whatif deployments. But during the deployment I now encountered some deployment errors, seems to me that there is a problem with the management groups hierarchy syncronization during the deployment flow. Is that a known issue ?
{"code":"DeploymentFailed","message":"At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/arm-deployment-operations for usage details.","details":[{"code":"InvalidCreatePolicyAssignmentRequest","message":"The policy definition specified in policy assignment 'Deny-MgmtPorts-Internet' is out of scope. Policy definitions should be specified only at or above the policy assignment scope. If the management groups hierarchy changed recently or if assigning a management group policy to new subscription, please allow up to 30 minutes for the hierarchy changes to apply and try again."}]}
@jdrepo any errors that include the text please allow up to 30 minutes for the hierarchy changes to apply and try again.
are related to delays in the policy engine registering the policy. We had a mitigation in place to minimize this, but seems the ARM/Policy engines are struggling to keep up - so will be extending the deployment wait time to help minimize this issue.
TLDR Policy isn't available for assignment yet. If you re-run the deployment with the same parameters, it will succeed.
@Springstone can you give me short notice how and if I can extend deployment wait time ?
For the what-if issue: Did you get an response from internal engineering ? I`ve opened a support call and would it be helpful to link these calls ?
@jdrepo you can't change the wait time yourself, but we've pushed through a patch to increase the wait time an additional couple of minutes, use the latest release (https://github.com/Azure/Enterprise-Scale/tree/2024-10-14).
I am working with engineering on the WHATIF issue but seems something has changed as it suddenly started working this morning. Could you confirm that it is working for you?
@Springstone isn'it possible to change the wait time if I modify the parameter "delayCount" in the template parameter file ?
Did another WHATIF deployment against the "westeurope" region and the issue still occurs ? What I can´t understand if I start the deplyment against the "westeurope" region, why I get an error message mentioning the "switzerlandnorth" region ?
@jdrepo yes, you can increase the delayCount
, did see it in the portal deployment parameters file - I've been working with other template param files :) for testing that don't include those parameters.
I've confirmed with PG this is a transient issue, which is why it sometimes works and sometimes doesn't (very inconsistent). I ran this on the weekend and 9/10 worked fine, one time it failed with a similar error.
I wouldn't worry about the SwitzerlandNorth, as it could be that an RP or part of ARM is running from there - the error message doesn't indicate an actual issue with deployment, just that it's a long running operation (basically it's timed out).
Describe the bug
Deploy ALZ reference implementation with PowerShell doesn´t work
Steps to reproduce
New-AzTenantDeployment -Name "ALZ-Deployment-$(Get-Date -Format 'yyyyMMddTHHMM')" -Location "westeurope" -TemplateUri "https://raw.githubusercontent.com/Azure/Enterprise-Scale/2024-09-03/eslzArm/eslzArm.json" -TemplateParameterFile ".\ALZ-Portal-parametersFile.json" -WhatIf
Getting the latest status of all resources...
foreverSeems to me that there is the same error/behaviour in the "Test Portal experience" workflow in this repo ?
Screenshots
Is this a general problem with PowerShell deployment at the moment ?