Azure / CanadaPubSecALZ

This reference implementation is based on Cloud Adoption Framework for Azure and provides an opinionated implementation that enables ITSG-33 regulatory compliance by using NIST SP 800-53 Rev. 4 and Canada Federal PBMM Regulatory Compliance Policy Sets.
MIT License
121 stars 87 forks source link

alz-05 (machinelearning-1) deployment fails #380

Closed skeeler closed 1 year ago

skeeler commented 1 year ago

Deployment for the alz-05 (machinelearning-1) subscription fails with the following error:

Deploying G:\Azure\CanadaPubSecALZ\config\subscriptions\CanadaPubSecALZ-main\DevTest\populated-f881fccb-2598-4b9c-b87c-b392f5e16f12_machinelearning_canadacentral.json to f881fccb-2598-4b9c-b87c-b392f5e16f12 in canadacentral
alz-05 (machinelearning-1) (f881fccb-25… 8b55e126-5261-488d-a427-39… alz-05 (machinelearning-1)  AzureCloud                 0d466ba2-7ea1-420f-9820-2…

New-AzDeployment: G:\Azure\CanadaPubSecALZ\scripts\deployments\Functions\Subscriptions.ps1:109
Line |
 109 |      New-AzSubscriptionDeployment `
     |      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     | 2:06:45 PM - The deployment 'main-canadacentral' failed with error(s). Showing 2 out of 2 error(s). Status Message: Identity operation for
     | resource
     | '/subscriptions/f881fccb-2598-4b9c-b87c-b392f5e16f12/resourceGroups/azmlsqlauth-compute/providers/Microsoft.ContainerService/managedClusters/aks4ucg27ijruxoy' failed with error 'Failed to perform resource identity operation. Status: 'Conflict'. Response: '{"error":{"code":"Conflict","message":"Request specified that resource '/subscriptions/f881fccb-2598-4b9c-b87c-b392f5e16f12/resourcegroups/azmlsqlauth-compute/providers/Microsoft.ContainerService/managedClusters/aks4ucg27ijruxoy' is new, but resource already exists. This may be due to a pending delete operation, try again later."}}'.'. (Code:FailedIdentityOperation)  Status Message: At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/arm-deployment-operations for usage details. (Code: DeploymentFailed)  - The resource write operation failed to complete successfully, because it reached terminal provisioning state 'Failed'. (Code: ResourceDeploymentFailure)    - At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/arm-deployment-operations for usage details. (Code: DeploymentFailed)      - Identity operation for resource '/subscriptions/f881fccb-2598-4b9c-b87c-b392f5e16f12/resourceGroups/azmlsqlauth-compute/providers/Microsoft.ContainerService/managedClusters/aks4ucg27ijruxoy' failed with error 'Failed to perform resource identity operation. Status: 'Conflict'. Response: '{"error":{"code":"Conflict","message":"Request specified that resource '/subscriptions/f881fccb-2598-4b9c-b87c-b392f5e16f12/resourcegroups/azmlsqlauth-compute/providers/Microsoft.ContainerService/managedClusters/aks4ucg27ijruxoy' is new, but resource already exists. This may be due to a pending delete operation, try again later."}}'.'. (Code:FailedIdentityOperation)     CorrelationId: d97a178b-5de7-4c3b-b28e-752886e81fea

Resolution: remove the existing Azure ML resource and re-run deployment script / pipeline / workflow for this subscription.

skeeler commented 1 year ago

New deployment failure/error:

Deploying G:\Azure\CanadaPubSecALZ\config\subscriptions\CanadaPubSecALZ-main\DevTest\populated-f881fccb-2598-4b9c-b87c-b392f5e16f12_machinelearning_canadacentral.json to f881fccb-2598-4b9c-b87c-b392f5e16f12 in canadacentral
alz-05 (machinelearning-1) (f881fccb-25… 8b55e126-5261-488d-a427-39… alz-05 (machinelearning-1)  AzureCloud                 0d466ba2-7ea1-420f-9820-2…

New-AzDeployment: G:\Azure\CanadaPubSecALZ\scripts\deployments\Functions\Subscriptions.ps1:109
Line |
 109 |      New-AzSubscriptionDeployment `
     |      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     | An error occurred while sending the request.

StatusCode     : 
TargetSite     : Void HandleException(System.Runtime.ExceptionServices.ExceptionDispatchInfo)
Message        : An error occurred while sending the request.
Data           : {}
InnerException : System.IO.IOException: Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host..
                  ---> System.Net.Sockets.SocketException (10054): An existing connection was forcibly closed by the remote host.
                    --- End of inner exception stack trace ---
                    at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ThrowException(SocketError error, CancellationToken cancellationToken)
                    at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.System.Threading.Tasks.Sources.IValueTaskSource<System.Int32>.GetResult(Int16 token)
                    at System.Net.Security.SslStream.EnsureFullTlsFrameAsync[TIOAdapter](CancellationToken cancellationToken)
                    at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder`1.StateMachineBox`1.System.Threading.Tasks.Sources.IValueTaskSource<TResult>.GetResult(Int16 token)
                    at System.Net.Security.SslStream.ReadAsyncInternal[TIOAdapter](Memory`1 buffer, CancellationToken cancellationToken)
                    at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder`1.StateMachineBox`1.System.Threading.Tasks.Sources.IValueTaskSource<TResult>.GetResult(Int16 token)
                    at System.Net.Http.HttpConnection.SendAsyncCore(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)

Attempt re-run to determine whether transient.

skeeler commented 1 year ago

Now getting an error on previously deleted workspace retention:

Deploying G:\Azure\CanadaPubSecALZ\config\subscriptions\CanadaPubSecALZ-main\DevTest\populated-f881fccb-2598-4b9c-b87c-b392f5e16f12_machinelearning_canadacentral.json to f881fccb-2598-4b9c-b87c-b392f5e16f12 in canadacentral
alz-05 (machinelearning-1) (f881fccb-25… 8b55e126-5261-488d-a427-39… alz-05 (machinelearning-1)  AzureCloud                 0d466ba2-7ea1-420f-9820-2…

New-AzDeployment: G:\Azure\CanadaPubSecALZ\scripts\deployments\Functions\Subscriptions.ps1:109
Line |
 109 |      New-AzSubscriptionDeployment `
     |      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     | 4:45:03 PM - The deployment 'main-canadacentral' failed with error(s). Showing 3 out of 3 error(s). Status Message: Unable to edit or replace
     | deployment 'deploy-aks-kubenet': previous deployment from '7/18/2023 8:40:40 PM' is still active (expiration time is '7/25/2023 6:55:33 PM').
     | Please see https://aka.ms/arm-deploy-resources for usage details. (Code:DeploymentActive)  Status Message: Soft-deleted workspace exists.
     | Please purge or recover it. https://aka.ms/wsoftdelete (Code:BadRequest)  Status Message: At least one resource deployment operation failed.
     | Please list deployment operations for details. Please see https://aka.ms/arm-deployment-operations for usage details. (Code: DeploymentFailed)
     | - The resource write operation failed to complete successfully, because it reached terminal provisioning state 'Failed'. (Code:
     | ResourceDeploymentFailure)    - At least one resource deployment operation failed. Please list deployment operations for details. Please see
     | https://aka.ms/arm-deployment-operations for usage details. (Code: DeploymentFailed)      - The resource write operation failed to complete
     | successfully, because it reached terminal provisioning state 'Failed'. (Code: ResourceDeploymentFailure)        - Soft-deleted workspace
     | exists. Please purge or recover it. https://aka.ms/wsoftdelete (Code:BadRequest)      CorrelationId: 48f44041-e890-41b9-803e-469b70168458

Follow guidance at Manage soft deleted workspaces to purge and re-run.

skeeler commented 1 year ago

Now, some sort of timeout error:

Deploying G:\Azure\CanadaPubSecALZ\config\subscriptions\CanadaPubSecALZ-main\DevTest\populated-f881fccb-2598-4b9c-b87c-b392f5e16f12_machinelearning_canadacentral.json to f881fccb-2598-4b9c-b87c-b392f5e16f12 in canadacentral
alz-05 (machinelearning-1) (f881fccb-25… 8b55e126-5261-488d-a427-39… alz-05 (machinelearning-1)  AzureCloud                 0d466ba2-7ea1-420f-9820-2…

New-AzDeployment: G:\Azure\CanadaPubSecALZ\scripts\deployments\Functions\Subscriptions.ps1:109
Line |
 109 |      New-AzSubscriptionDeployment `
     |      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     | The operation was canceled.

CancellationToken : System.Threading.CancellationToken
TargetSite        : Void HandleException(System.Runtime.ExceptionServices.ExceptionDispatchInfo)
Message           : The operation was canceled.
Data              : {}
InnerException    : 
HelpLink          : 
Source            : Microsoft.Azure.PowerShell.Cmdlets.ResourceManager
HResult           : -2146233029
StackTrace        :    at Microsoft.Azure.Commands.ResourceManager.Cmdlets.Implementation.ResourceManagerCmdletBase.HandleException(ExceptionDispatchInfo capturedException)
                       at Microsoft.Azure.Commands.ResourceManager.Cmdlets.Implementation.ResourceManagerCmdletBase.ExecuteCmdlet()
                       at Microsoft.WindowsAzure.Commands.Utilities.Common.AzurePSCmdlet.ProcessRecord()

Elapsed time: 01:15:09.7283616

Let's try running from the GitHub workflow next...

skeeler commented 1 year ago

Next error:

New-AzDeployment: /home/runner/work/CanadaPubSecALZ/CanadaPubSecALZ/scripts/deployments/Functions/Subscriptions.ps1:109
Line |
 109 |      New-AzSubscriptionDeployment `
     |      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     | 13:26:25 - The deployment 'main-canadacentral' failed with error(s).
     | Showing 3 out of 4 error(s). Status Message: This Private Endpoint
     | /subscriptions/f881fccb-2598-4b9c-b87c-b392f5e16f12/resourceGroups/azmlsqlauth-compute/providers/Microsoft.Network/privateEndpoints/aml4ucg27ijruxoy-endpoint can not be updated since it's in disconnected state. Please delete it and create a new one. (Code: PrivateEndpointCannotBeUpdatedInDisconnectedState)   Status Message: At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/arm-deployment-operations for usage details. (Code: DeploymentFailed)  - The resource write operation failed to complete successfully, because it reached terminal provisioning state 'Failed'. (Code: ResourceDeploymentFailure)    - At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/arm-deployment-operations for usage details. (Code: DeploymentFailed)      - This Private Endpoint /subscriptions/f881fccb-2598-4b9c-b87c-b392f5e16f12/resourceGroups/azmlsqlauth-compute/providers/Microsoft.Network/privateEndpoints/aml4ucg27ijruxoy-endpoint can not be updated since it's in disconnected state. Please delete it and create a new one. (Code: PrivateEndpointCannotBeUpdatedInDisconnectedState)      Status Message: At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/arm-deployment-operations for usage details. (Code: DeploymentFailed)  - Identity operation for resource '/subscriptions/f881fccb-2598-4b9c-b87c-b392f5e16f12/resourceGroups/azmlsqlauth-compute/providers/Microsoft.ContainerService/managedClusters/aks4ucg27ijruxoy' failed with error 'Failed to perform resource identity operation. Status: 'Conflict'. Response: '{"error":{"code":"Conflict","message":"Request specified that resource '/subscriptions/f881fccb-2598-4b9c-b87c-b392f5e16f12/resourcegroups/azmlsqlauth-compute/providers/Microsoft.ContainerService/managedClusters/aks4ucg27ijruxoy' is new, but resource already exists. This may be due to a pending delete operation, try again later."}}'.'. (Code:FailedIdentityOperation)   CorrelationId: 2437437f-484c-4182-9c68-da51cd8a42f7

Error: Process completed with exit code 1.
skeeler commented 1 year ago

Next error:

New-AzDeployment: /home/runner/work/CanadaPubSecALZ/CanadaPubSecALZ/scripts/deployments/Functions/Subscriptions.ps1:109
Line |
 109 |      New-AzSubscriptionDeployment `
     |      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     | 16:31:46 - The deployment 'main-canadacentral' failed with error(s).
     | Showing 3 out of 3 error(s). Status Message: At least one resource
     | deployment operation failed. Please list deployment operations for
     | details. Please see https://aka.ms/arm-deployment-operations for usage
     | details. (Code: DeploymentFailed)  - The resource write operation
     | failed to complete successfully, because it reached terminal
     | provisioning state 'Failed'. (Code: ResourceDeploymentFailure)    -
     | Soft-deleted workspace exists. Please purge or recover it.
     | https://aka.ms/wsoftdelete (Code:BadRequest)    Status Message: At least
     | one resource deployment operation failed. Please list deployment
     | operations for details. Please see
     | https://aka.ms/arm-deployment-operations for usage details. (Code:
     | DeploymentFailed)  - The resource write operation failed to complete
     | successfully, because it reached terminal provisioning state 'Failed'.
     | (Code: ResourceDeploymentFailure)    - At least one resource deployment
     | operation failed. Please list deployment operations for details. Please
     | see https://aka.ms/arm-deployment-operations for usage details. (Code:
     | DeploymentFailed)      - The resource write operation failed to
     | complete successfully, because it reached terminal provisioning state
     | 'Failed'. (Code: ResourceDeploymentFailure)        - Soft-deleted
     | workspace exists. Please purge or recover it. https://aka.ms/wsoftdelete
     | (Code:BadRequest)      Status Message: The resource provision operation
     | did not complete within the allowed timeout period.
     | (Code:ResourceDeploymentFailure)  CorrelationId:
     | eeebac2d-958a-4f48-97bb-fddc57d3a67d

Error: Process completed with exit code 1.
skeeler commented 1 year ago

Going around in circles with this one. Closing and opening a new issue to track all four machine learning archetype deployments.