Azure / terraform-azurerm-avm-ptn-alz

Terraform module to deploy Azure Landing Zones
https://registry.terraform.io/modules/Azure/avm-ptn-alz/azurerm
MIT License
66 stars 16 forks source link

[AVM Question/Feedback]: Using variables in the custom library files #98

Closed chrholt closed 3 weeks ago

chrholt commented 2 months ago

Check for previous/existing GitHub issues

Description

In the Azure/terraform-azurerm-caf-enterprise-scale module we have defined the module in the root directory and use the same configuration for different environments referencing root in subdirectories. This approach ensures that all environments have the same policies and so on. We rely on the template_file_variables variable to pass different values to the various environments.

I wonder if it will be a similar approach here, making variables available in the custom library files? If so, great! If not, should I put it in a feature request or has it already been considered?

matt-FFFFFF commented 2 months ago

Hi!

You can still use a templating approach but this needs to be external to the provider and module.

Basically at init time the file needs to be correct

matt-FFFFFF commented 2 months ago

I had the idea of using a separate Terraform state file to generate the custom lib from templates.

./lib
  - my.alz_architecture_definition.json.tftpl
  - main.tf

You could then consider generating the custom lib every time you run the pipeline. Do not store the resultant architecture definition files in Git, rely instead on the Terraform to generate them.

Then the steps of your pipeline would be:

  1. Pre-reqs, auth, etc
  2. Run tf init, apply in the ./lib dir to generate the artifacts
  3. Run tf init, apply in the root dir to deploy

Thoughts?

chrholt commented 2 months ago

So anyway, we will have to generate the json files upfront somehow for them to be sent to the provider. We will try it and see if that will work for us. Thanks

JWilkinsonMB commented 2 months ago

Would using policy_default_values & policy_assignments_to_modify work for this?

I'm starting to use this module to manage 3 environments and I can customize the policy assignment parameter values for each environment using the above functionality, whilst only defining the policies once. No external templating needed.

Mainly I'm only updating the parameter values using this method, with a couple of instances of the non-compliance message for specific policy assignments. If you've got more complex templating requirements then I guess this might not work.

I'm also referencing 2 separate libraries for the ALZ provider, one in the parent for any common policies / archetypes, and one in the environment folder for the architecture definition and anything custom that might be needed for that environment (not had to do that yet though).

provider "alz" {
  library_references = [
    {
      path = "platform/alz",
      ref  = "2024.07.02"
    },
    {
      custom_url = "${path.cwd}/../alz_lib_common"
    },
    {
      custom_url = "${path.cwd}/alz_lib"
    }
  ]
}
chrholt commented 2 months ago

That might also work, I will definitely give that a try! Thanks for the suggestion @JWilkinsonMB :-)

chrholt commented 2 months ago

@JWilkinsonMB Do you experience any trouble when it comes to roleDefinitions with your setup? I have set up 3 different environments and testing them locally to simulate a pipeline run by triggering plan and apply simultaneously in the 3 environments. However I seem to hit an error quite often and the error is any of or multiple of these role definitions.

Example:

╷
│ Error: Missing Resource State After Create
│ 
│   with module.alz_architecture.module.role_definitions["new3/Subscription-Owner"].azapi_resource.this,
│   on .terraform/modules/alz_architecture/modules/azapi_helper/main.tf line 1, in resource "azapi_resource" "this":
│    1: resource "azapi_resource" "this" {
│ 
│ The Terraform Provider unexpectedly returned no resource state after having no errors in the resource creation. This is always an issue
│ in the Terraform Provider and should be reported to the provider developers.
│ 
│ The resource may have been successfully created, but Terraform is not tracking it. Applying the configuration again with no other
│ action may result in duplicate resource errors. Import the resource if the resource was actually created and Terraform should be
│ tracking it.
╵
╷
│ Error: Missing Resource State After Create
│ 
│   with module.alz_architecture.module.role_definitions["new3/Application-Owners"].azapi_resource.this,
│   on .terraform/modules/alz_architecture/modules/azapi_helper/main.tf line 1, in resource "azapi_resource" "this":
│    1: resource "azapi_resource" "this" {
│ 
│ The Terraform Provider unexpectedly returned no resource state after having no errors in the resource creation. This is always an issue
│ in the Terraform Provider and should be reported to the provider developers.
│ 
│ The resource may have been successfully created, but Terraform is not tracking it. Applying the configuration again with no other
│ action may result in duplicate resource errors. Import the resource if the resource was actually created and Terraform should be
│ tracking it.
╵
╷
│ Error: Failed to create/update resource
│ 
│   with module.alz_architecture.module.role_definitions["new3/Security-Operations"].azapi_resource.this,
│   on .terraform/modules/alz_architecture/modules/azapi_helper/main.tf line 1, in resource "azapi_resource" "this":
│    1: resource "azapi_resource" "this" {
│ 
│ creating/updating Resource: (ResourceId
│ "/providers/Microsoft.Management/managementGroups/new3/providers/Microsoft.Authorization/roleDefinitions/d3584a79-4f0d-5980-aa3c-7a76ba783b76"
│ / Api Version "2022-04-01"): GET
│ https://management.azure.com/providers/Microsoft.Management/managementGroups/new3/providers/Microsoft.Authorization/roleDefinitions/d3584a79-4f0d-5980-aa3c-7a76ba783b76
│ --------------------------------------------------------------------------------
│ RESPONSE 404: 404 Not Found
│ ERROR CODE: RoleDefinitionDoesNotExist
│ --------------------------------------------------------------------------------
│ {
│   "error": {
│     "code": "RoleDefinitionDoesNotExist",
│     "message": "The specified role definition with ID 'd3584a79-4f0d-5980-aa3c-7a76ba783b76' does not exist."
│   }
│ }
│ --------------------------------------------------------------------------------

I suspect that it primarily has to do with the simultaneous plan/apply operations and possibly the "common" library, but I'm not sure.

I have met other errors as well for management groups and missing permissions on these. (But this is only occurring when i run apply the first times - until all groups are created and have been imported). Example:

Error: Failed to create/update resource
│ 
│   with module.alz_architecture.module.management_groups_level_1["new3-landingzones"].azapi_resource.this,
│   on .terraform/modules/alz_architecture/modules/azapi_helper/main.tf line 1, in resource "azapi_resource" "this":
│    1: resource "azapi_resource" "this" {
│ 
│ creating/updating Resource: (ResourceId "/providers/Microsoft.Management/managementGroups/new3-landingzones" / Api Version
│ "2023-04-01"): GET https://management.azure.com/providers/Microsoft.Management/managementGroups/new3-landingzones
│ --------------------------------------------------------------------------------
│ RESPONSE 403: 403 Forbidden
│ ERROR CODE: AuthorizationFailed
│ --------------------------------------------------------------------------------
│ {
│   "error": {
│     "code": "AuthorizationFailed",
│     "message": "The client 'XYZ' with object id 'aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee' does not have authorization to perform action 'Microsoft.Management/managementGroups/read' over scope '/providers/Microsoft.Management/managementGroups/new3-landingzones' or the scope is invalid. If access was recently granted, please refresh your credentials."
│   }
│ }
│ --------------------------------------------------------------------------------
│ 
╵

I have not been able to find out exactly why this happens, setting the timeouts and delays variables haven't had any effect either. Have you experienced anything like this?


Workspace structure:

alzlib/
    archetype_overrides/
        root_override.alz_archetype_override.json
    policy_assignments/
        audit_network_watcher.alz_policy_assignments.json
env/
    local/
        new/
            ***
        new2/
            ***
        new3/
            lib/
                architecture_definitions/
                    new_arch.alz_architecture_definition.json
            main.tf
            provider.tf

provider.tf:

### ALZ

provider "alz" {
  library_references = [
    {
      path = "platform/alz",
      ref  = "2024.07.02"
    },
    {
      custom_url = "../../../alz_lib" #Common library
    },
    {
      custom_url = "lib" #Environment specific library
    }
  ]
}

main.tf:

data "azapi_client_config" "current" {}

module "alz_architecture" {
  source             = "Azure/avm-ptn-alz/azurerm"
  version            = "0.8.1"
  location           = "norwayeast"
  architecture_name  = "new-arch"
  parent_resource_id = data.azapi_client_config.current.tenant_id

  delays = {
    after_management_group = {
      create  = "5m"
      destroy = "30s"
    }
  }

  timeouts = {
    management_group = {
      create = "2m"
      delete = "1m"
      update = "1m"
      read   = "30s"
    }
    role_definition = {
      create = "30s"
      delete = "30s"
      update = "30s"
      read   = "30s"
    }
  }
}

I should mention that I have met a similar error for policy assignment too, but it's then complaining that the assignment is out of scope. The common thing for any of these errors is that in the consecutive plan/apply it says the resource is already existing and that it needs to be imported into state..

JWilkinsonMB commented 2 months ago

@chrholt I've not had the issue getting 'Missing Resource State After Create', but several times I've had the 403 error. A subsequent re-run or two of the deployment fixed it.

I think changing the timeouts and delays should in theory fix it, but on the Azure side it says changes can take up to 10 minutes and setting the delay to 10 minutes for everything is going to lead to a very slow deployment.

Looking at https://github.com/Azure/terraform-azurerm-avm-ptn-alz/pull/103 it seems the intent is to move away from fixed delays to use a retry feature with exponential backoff that's being introduced in v2.0.0 of the azapi provider (https://github.com/Azure/terraform-provider-azapi/pull/392).

Hopefully that will fix these issues, but @matt-FFFFFF will be far better qualified to comment.

matt-FFFFFF commented 2 months ago

AzAPI v2 makes the above much easier to deal with. As you can see on the PR we are able to retry on these common errors...

Hold fire until the next release as we also deprecate the helper module.

matt-FFFFFF commented 2 months ago

Also it's great to see folks coming up with innovative ways to use this! Awesome 🙌

matt-FFFFFF commented 1 month ago

The missing resource state after create errors have been fixed and will be released in azapi v2.0

matt-FFFFFF commented 1 month ago

Adding #RR to see if there is anything further required here

JWilkinsonMB commented 1 month ago

Not the original author, but from my perspective v0.9.0-beta with v2 AzAPI has been working really well. Much faster to deploy and so far at least, without the intermittent errors described above.

matt-FFFFFF commented 1 month ago

Excellent! Glad to hear it

chrholt commented 1 month ago

When it comes to variables I was able to achieve what we need using default_policy_values and policy_assignments_to_modify, as suggested by JWilkinsonMB :-)

Good to hear about the retry feature too, I have not yet been able to test it though.

For our use case we need to have the ability to assign notScopes dynamically, but for that I have created a separate issue in terraform-provider-alz so it's not really relevant to this particular issue.

matt-FFFFFF commented 1 month ago

We will follow with this feature after GA!

microsoft-github-policy-service[bot] commented 3 weeks ago
microsoft-github-policy-service[bot] commented 3 weeks ago

[!IMPORTANT] @chrholt, this issue has been automatically marked as stale because it has been marked as requiring author feedback but has not had any activity for 4 days. It will be closed if no further activity occurs within 3 days of this comment.

microsoft-github-policy-service[bot] commented 3 weeks ago

[!WARNING] @chrholt, this issue will now be closed, as it has been marked as requiring author feedback but has not had any activity for 7 days.