Azure / AKS-Construction

Accelerate your onboarding to AKS with; Helper Web App, bicep templating and CI/CD samples. Flexible & secure AKS baseline implementations in a Microsoft + community maintained reference implementation.
https://azure.github.io/AKS-Construction/
MIT License
356 stars 165 forks source link

CI/CD Deployment testing failure #648

Open Gordonby opened 1 year ago

Gordonby commented 1 year ago

Describe the bug During full deployment tests in the CI/CD pipeline, we get an error because of the state of the environment we're deploying to.

To Reproduce

  1. Tag a PR with either; test-deploy-byoconfig test-deploy-privateconfig
  2. Wait for checks to run and fail.

Expected behavior Environment considerations are properly reset so deployment tests can run.

Additional context

ERROR: {"status":"Failed","error":{"code":"DeploymentFailed","message":"At least one resource deployment operation failed. 
Please list deployment operations for details. Please see https://aka.ms/arm-deployment-operations for usage details.",
"details":[{"code":"BadRequest","message":"{\r\n  \"error\": {\r\n    
\"code\": \"RoleAssignmentUpdateNotPermitted\",\r\n    
\"message\": \"Tenant ID, application ID, principal ID, and scope are not allowed to be updated.\"\r\n  }\r\n}"}]}}

The role assignment thats having the problem is the RG Reader role for AppGw;

// AGIC's identity requires "Reader" permission over Application Gateway's resource group.
var reader = subscriptionResourceId('Microsoft.Authorization/roleDefinitions', 'acdd72a7-3385-48ef-bd42-f606fba81ae7')
resource appGwAGICRGReader 'Microsoft.Authorization/roleAssignments@2022-04-01' = if (ingressApplicationGateway && deployAppGw) {
  scope: resourceGroup()
  name: guid(aks.id, 'Agic', reader)
  properties: {
    roleDefinitionId: reader
    principalType: 'ServicePrincipal'
    principalId: aks.properties.addonProfiles.ingressApplicationGateway.identity.objectId
  }
}

The reason it's having a problem is because the name isn't unique. It's using a static 'Agic' string instead of an identifier to the identity such as principalId. This is because the identity is not known before main.bicep is launched, therefore it cannot form part of the name.

I see 3 options for resolution;

  1. Refactor the AppGw out of main.bicep, which will allow the role assignment name guid to be based on the already existing AKS AppGW ObjectId.
  2. Debug why the environment cleanup is not removing this role assignment (it should). Implement fix.
  3. Speak to the AKS team about addOn support for existing managed identities
Gordonby commented 1 year ago

Raised option 3 in the AKS repo.

https://github.com/Azure/AKS/issues/3863

samaea commented 1 year ago

Thanks for reporting @Gordonby. + @khowling what is the impact on the above?

Gordonby commented 1 year ago

Thanks for reporting @Gordonby. + @khowling what is the impact on the above?

The impact is that ;

  1. 2 of the scheduled CI jobs will always fail.
  2. Full testing in PR will always fail
  3. My Pr is blocked https://github.com/Azure/AKS-Construction/pull/647

image

samaea commented 1 year ago

@mosabami to review next steps.

mosabami commented 11 months ago

@pjlewisuk will be discussing this with @Gordonby and @samaea