Azure / ALZ-Bicep

This repository contains the Azure Landing Zones (ALZ) Bicep modules that help deliver and deploy the Azure Landing Zone conceptual architecture in a modular approach. https://aka.ms/alz/docs
MIT License
749 stars 501 forks source link

🪲 Policy definitions should be specified only at or above the policy set definition's scope. #179

Closed autocloudarc closed 2 years ago

autocloudarc commented 2 years ago

Describe the bug

A list of policy definitions are shown as invalid in the error message.

To Reproduce

Steps to reproduce the behaviour:

  1. Run the pipeline with the following env: variables values:
    env:
    ManagementGroupPrefix: "alz07"
    TopLevelManagementGroupDisplayName: "Azure Landing Zones"
    Location: "centralus"
    LoggingSubId: "redacted"
    LoggingResourceGroupName: "alz-logging-rgp-01"
    HubNetworkSubId: "redacted"
    HubNetworkResourceGroupName: "alz-network-hub-rgp-01"
    RoleAssignmentManagementGroupId: "alz07"
    SpokeNetworkSubId: "redacted"
    SpokeNetworkResourceGroupName: "spoke-networking-rgp-01"
    runNumber: ${{ github.run_number }}

Expected behaviour

Pipeline runs without error and deploys the platform and landing zones.

Screenshots 📷

image

image

Correlation ID

A correlation ID really helps us investigate your issue further. Please provide one if possible. Details on how to find a correlation ID can be found here: Correlation ID and support

3c8058dc-ca85-4731-92f3-fbc76f96d8f3

Additional context

{
    "status": "Failed",
    "error": {
        "code": "InvalidCreatePolicySetDefinitionRequest",
        "message": "The policy set definition 'Deploy-Diagnostics-LogAnalytics' request is invalid. Policy definitions should be specified only at or above the policy set definition's scope. The following policy definitions are invalid: 'Deploy-Diagnostics-ACI,Deploy-Diagnostics-ACR,Deploy-Diagnostics-AnalysisService,Deploy-Diagnostics-ApiForFHIR,Deploy-Diagnostics-APIMgmt,Deploy-Diagnostics-ApplicationGateway,Deploy-Diagnostics-WebServerFarm,Deploy-Diagnostics-Website,Deploy-Diagnostics-AA,Deploy-Diagnostics-CDNEndpoints,Deploy-Diagnostics-CognitiveServices,Deploy-Diagnostics-CosmosDB,Deploy-Diagnostics-Databricks,Deploy-Diagnostics-DataExplorerCluster,Deploy-Diagnostics-DataFactory,Deploy-Diagnostics-DLAnalytics,Deploy-Diagnostics-EventGridSub,Deploy-Diagnostics-EventGridTopic,Deploy-Diagnostics-EventGridSystemTopic,Deploy-Diagnostics-ExpressRoute,Deploy-Diagnostics-Firewall,Deploy-Diagnostics-FrontDoor,Deploy-Diagnostics-Function,Deploy-Diagnostics-HDInsight,Deploy-Diagnostics-iotHub,Deploy-Diagnostics-LoadBalancer,Deploy-Diagnostics-LogicAppsISE,Deploy-Diagnostics-MariaDB,Deploy-Diagnostics-MediaService,Deploy-Diagnostics-MlWorkspace,Deploy-Diagnostics-MySQL,Deploy-Diagnostics-NIC,Deploy-Diagnostics-NetworkSecurityGroups,Deploy-Diagnostics-PostgreSQL,Deploy-Diagnostics-PowerBIEmbedded,Deploy-Diagnostics-RedisCache,Deploy-Diagnostics-Relay,Deploy-Diagnostics-SignalR,Deploy-Diagnostics-SQLElasticPools,Deploy-Diagnostics-SQLMI,Deploy-Diagnostics-TimeSeriesInsights,Deploy-Diagnostics-TrafficManager,Deploy-Diagnostics-VM,Deploy-Diagnostics-VirtualNetwork,Deploy-Diagnostics-VMSS,Deploy-Diagnostics-VNetGW,Deploy-Diagnostics-WVDAppGroup,Deploy-Diagnostics-WVDHostPools,Deploy-Diagnostics-WVDWorkspace'."
    }
}
jtracey93 commented 2 years ago

@autocloudarc This looks like a RACE condition issue we see sometimes in all deployment experiences.

Did you customise the policy definitions module at all?

The fix/workaround

Give it 10 minutes and run again and it should be fine.

This happens due to ARM not replicating fast enough in the regions you are deploying to. And sometimes the ARM nodes processing the request haven't caught up and think the policies don't exist, but they do, it just hasn't replicated fully yet.

Running again fixes this as the gap between runs allows the relocation to catch up.

Linking this to a master issue we are tracking and working with engineering teams on to resolve. https://github.com/Azure/Enterprise-Scale/issues/902

Thanks and let us know if this works or doesn't. 👍

autocloudarc commented 2 years ago

Thanks @jtracey93 . After 2 hours between runs, then 2 minutes later after that with a subsequent run, same issue unfortunately. Here is the screenshot of the latest:

image

autocloudarc commented 2 years ago

Also @jtracey93; No, I didn't customize any of the policy definition modules at all.

autocloudarc commented 2 years ago

Here is the Deploy Custom Policy Definitions step now from the deploy-alz.yml file.

      - name: Deploy Custom Policy Definitions
        id: create_policy_defs
        uses: azure/arm-deploy@v1
        with:
          scope: managementgroup
          managementGroupId: ${{ env.ManagementGroupPrefix }}
          region: ${{ env.Location }}
          template: infra-as-code/bicep/modules/policy/definitions/custom-policy-definitions.bicep
          parameters: infra-as-code/bicep/modules/policy/definitions/custom-policy-definitions.parameters.example.json
          deploymentName: create_policy_defs-${{ env.runNumber }}
          failOnStdErr: false
ejhenry commented 2 years ago

@autocloudarc Are you using the default value for parTargetManagementGroupID in the custom-policy-definitions.parameters.example.json file?

jtracey93 commented 2 years ago

@autocloudarc Are you using the default value for parTargetManagementGroupID in the custom-policy-definitions.parameters.example.json file?

I think @ejhenry may have found the issue here. Ensure the parameters you are passing in are correct for the management group ID as it uses this to lookup the intermediate root management group for the policy sets.

So if you are not using 'alz' then you need to update the parameters.

autocloudarc commented 2 years ago

@jtracey93 and @ejhenry . Thank you. Yes, updating that parameter to match my custom top level management prefix id of alz07 did work in that the deployment progressed a bit further, but a similar error re-appeared. It seems to be because there are still numerous other dependencies in various files that would need to be updated. Doing a [ctrl-shift-] shows the following for 'alz':

image

When I reverted my top-level management group id to the default of 'alz' as well as the value suggested by @ejhenry, it did get past the Deploy Custom Policy Definition step, however, it would have been better to have the ability to specify a custom prefix value once and then have that propagate to update all other relevant values throughout the code, so that part now is more of a feature request, which seems to have already been addressed in #158

jtracey93 commented 2 years ago

Hey @autocloudarc,

All of these are parameterised, so if you update the top-level management group prefix. You need to update all the parameter files to the same input. All the modules today, already support using a different top-level prefix, but you need to ensure you update the parameter files for each module to match or tailor to your needs.

This is by design to keep them modules flexible and customizable easily via parameter inputs.

This is not related to #158 for clarity.

Please ensure you read through each of the module README.md files to ensure you set the parameters correctly. https://github.com/Azure/ALZ-Bicep/wiki/DeploymentFlow

What where the other errors you saw?

ghost commented 2 years ago

This issue has been automatically marked as stale because it has been marked as requiring author feedback but has not had any activity for 4 days. It will be closed if no further activity occurs within 3 days of this comment.