Azure / terraform-azurerm-caf-enterprise-scale

Azure landing zones Terraform module
https://aka.ms/alz/tf
MIT License
875 stars 575 forks source link

Request for feedback: future direction for this module ⏫ #630

Closed matt-FFFFFF closed 1 year ago

matt-FFFFFF commented 1 year ago

We are evaluating the direction for v.next of this module and want your input.

We have a number of ideas for consideration, we may choose to implement some, all or none of these. When we mention the ALZ module, we are talking about the core ALZ components of management groups, policies and RBAC.

Ask

Please respond in the issue comments with feedback about the proposed changes, positive and negative. Please also use this thread to tell us about the overall experience of using the module.

Meet the ALZ Terraform product group

We are conducting research, if you have feedback on the module's future direction then we may want to speak to you. Please nominate yourselves if interested.

Current architecture

The module currently uses sub-modules as data providers for the core module, which deploys the resources.

flowchart LR
    main.tf-->ALZ
    subgraph module
        direction LR
        ALZ<-->connectivity
        ALZ<-->archetypes
        ALZ<-->management
    end

Proposal - More specific modules

flowchart LR
    main.tf-->ALZ
    main.tf-->hub&spoke
    main.tf-->vwan
    main.tf-->management

We have heard that some customers do not like that all resources are deployed by the same module. We are therefore considering the benefits of hosting a greater number of modules that are more focussed and that play nicely together. The advantage for us is that we can evolve them independently and they become easier to maintain and test. The disadvantage is that migration is complex.

Proposal - Simplify internal architecture

flowchart LR
    main.tf-->ALZ
    ALZ-->hub&spoke
    ALZ-->vwan
    ALZ-->management

Some of the internal workings of the module are complex and could be simplified. We are considering adopting a more traditional architecture and simplifying the core module, then potentially calling modules from within the core module. The advantage for us is that we can evolve them independently and they become easier to maintain and test. In this scenario we could support migration using the moved block.

birdnathan commented 1 year ago

Deploying everything from a single module doesnt really bother me as long as it works but i understand the challenge and would support the 2nd option. As a partner that uses these modules across many customers and tenants, lots of breaking changes and complex migration requirements does not appeal.

mw8er commented 1 year ago

I'm in for the 2nd option as well for the following reasons:

MarcelHeek commented 1 year ago

Hi, just to put my two cents in the discussion. We only consume the policies from the current module. As long as that is still easily consumable.......

tlfzhylj commented 1 year ago

The complexity of todays module is pretty big, and difficult to understand how thing fits together if you want to contribute or troubleshoot. If it was simpler and more straightforward I think more people would contribute.

Proposal regarding difficult to migrate if going for option 1: Maybe we can use the moved {} block in the root module, and you can provide a template for the moved blocks in a migration guide?

jtracey93 commented 1 year ago

The complexity of todays module is pretty big, and difficult to understand how thing fits together if you want to contribute or troubleshoot. If it was simpler and more straightforward I think more people would contribute.

Proposal regarding difficult to migrate if going for option 1: Maybe we can use the moved {} block in the root module, and you can provide a template for the moved blocks in a migration guide?

Thanks for the input @tlfzhylj,

Do you think if we just improved contribution docs with diagrams of where data flows etc, created some videos, and tidied up the code a little to make it simpler in places, this would address your feedback, instead of changing the modules?

jtracey93 commented 1 year ago

Hu, hust to put my two cents in the discussion. We only consume the policies from the current module. As long as that is still easily consumable.......

Thanks @MarcelHeek for the input here.

Do you therefore just set deploy_management_resources & deploy_connectivity_resources to false?

What are your thoughts on how this part of the module works and is flexible today?

Does it bother/concern you that there are other functionality in the code base (management and connectivity stuff) that you will never use just hanging around in the module?

Let us know

tlfzhylj commented 1 year ago

The complexity of todays module is pretty big, and difficult to understand how thing fits together if you want to contribute or troubleshoot. If it was simpler and more straightforward I think more people would contribute. Proposal regarding difficult to migrate if going for option 1: Maybe we can use the moved {} block in the root module, and you can provide a template for the moved blocks in a migration guide?

Thanks for the input @tlfzhylj,

Do you think if we just improved contribution docs with diagrams of where data flows etc, created some videos, and tidied up the code a little to make it simpler in places, this would address your feedback, instead of changing the modules?

No, it wasn't what I meant.

I absolutely think that the module should be re-architected, and I think both option 1 and 2 is good approaches. @matt-FFFFFF wrote under option 1 that "The disadvantage is that migration is complex". As a possible solution to migration, maybe one approach could be to use moved blocks in the root module (in other words, the consumers Terraform code). If we in the root module use moved blocks, we could move resources in state from one module to another module.

A little difficult to explain. English is not my native language. But I hope you understand what I mean.

matt-FFFFFF commented 1 year ago

No, it wasn't what I meant.

I absolutely think that the module should be re-architected, and I think both option 1 and 2 is good approaches. @matt-FFFFFF wrote under option 1 that "The disadvantage is that migration is complex". As a possible solution to migration, maybe one approach could be to use moved blocks in the root module (in other words, the consumers Terraform code). If we in the root module use moved blocks, we could move resources in state from one module to another module.

A little difficult to explain. English is not my native language. But I hope you understand what I mean.

Hi @tlfzhylj I totally understand what you mean. Thanks for your feedback. The option of a moved block in the caller's root module is something to consider :)

davelee212 commented 1 year ago

I work at an MSP and we've taken an approach more similar to "Proposal - More specific modules". We have separate Git repos for each of our TF modules and we tag commits with an incremented version number when they are ready for use. We then have sample configs that deploy different aspects of an ALZ. For instance; one for Management Groups, one for Identity subscription, for one a Connectivity subscription, for one a migrated IaaS workload, one for a greenfield AVD workload, etc.

We clone/fork these for customers when we do new deployments for them. Our delivery guys can customise the configs as needed (for example, adding our App GW module to our sample IaaS config if need be). It doesn't matter if we subsequently make breaking changes to resource specific modules as the cloned TF configs are pinned to the module versions they were using when they were forked/cloned. But, if we want to update a customers configuration to use a newer version of our module for a VM (for example), we can change reference to that module and test it specifically before using it.

I've found that to be a simpler and more flexible approach than one larger ALZ module and seems to be working for us so far. Some of it (especially azure policy) may have "borrowed heavily" though from this ALZ module!

hlokensgard commented 1 year ago

I find the first solution most appealing. For me the idea of having smaller modules that are easier to maintain, develop and test has a higher positive effect then negative impact of the one-time migration job. In the long run it will result in a higher customer value for the users of this module. However, I do understand the pain and struggle for everyone that need to migrate and would suggest that you provide a good solution for those who needs it.

My thoughts are that if you continue with this complex module, you will after some time increase the technical dept and complexity to a level that is really hard to comprehend for the core team and for those who wants to contribute to the module.

For the comparison between the different solutions. For me it feels more clean to have specific modules that each developer can pick from, rather then a huge wrapper module. It is easier for management of the code and the terraform state from my experience. It also makes it easier to keep up to date with the different modules. If you always want the latest update from the ALZ module and management module you can easily do that while tagging the hub&spoke module to a specific release because network changes could have some unpleasant surprises. This is of course just an example of the flexibility of having different modules can offer. It just add the possibility for a more flexibility for the developers.

I have seen several cases where developers split their code into different parts (folders and/or repositories). They basically call the module 3 times, one time for management, connectivity, and identity. If I understood @davelee212, they use the same solution. I feel this solution is very clean and easy to maintain and fits more into the first solution then the second, since they are already split up.

MarcelHeek commented 1 year ago

Hu, hust to put my two cents in the discussion. We only consume the policies from the current module. As long as that is still easily consumable.......

Thanks @MarcelHeek for the input here.

Do you therefore just set deploy_management_resources & deploy_connectivity_resources to false?

What are your thoughts on how this part of the module works and is flexible today?

Does it bother/concern you that there are other functionality in the code base (management and connectivity stuff) that you will never use just hanging around in the module?

Let us know

@jtracey93 We actually did not set these variables as these are optionaland the default is set to false. We only have the deploy_core_landing_zones = true variable set. We deployed the management/connectivity resources using the CAF supermodule, because our designed included vWAN and at the time vWAN feature was not ready yet for this module. A separate module for the each component with seperate development would make more sense, perhaps this can speed up releasing for policy changes consumed from the Enterprise-Scale repository.

adhodgson1 commented 1 year ago

When I show people the current module they tend to have a first reaction of wanting to move away because they don't easily understand how the pieces fit together. I have gone down the rabbit hole now and don't think I could do any better with something we put together ourselves so tend to use the module in my gigs. I can turn off most features of the module, for example in the current deployment I use it just for policies by switching off the deployment of the core management groups and management/connectivity resources, and providing my own base management groups. Having the power to use standard policy definitions and switch to built-in archetypes if I want to is very powerful and is something I will miss if this module becomes abandoned. I wouldn't lose any sleep if the modules for the management and connectivity resources got moved, I very rarely use these features, especially on the connectivity side.

matt-FFFFFF commented 1 year ago

When I show people the current module they tend to have a first reaction of wanting to move away because they don't easily understand how the pieces fit together. I have gone down the rabbit hole now and don't think I could do any better with something we put together ourselves so tend to use the module in my gigs. I can turn off most features of the module, for example in the current deployment I use it just for policies by switching off the deployment of the core management groups and management/connectivity resources, and providing my own base management groups. Having the power to use standard policy definitions and switch to built-in archetypes if I want to is very powerful and is something I will miss if this module becomes abandoned. I wouldn't lose any sleep if the modules for the management and connectivity resources got moved, I very rarely use these features, especially on the connectivity side.

We certainly aren't abandoning it! Thanks for the feedback.

torivara commented 1 year ago

I have not used this module as much as I would have liked, but I still had some thoughts when I tested it for possible utilization a while back.

Simplification would generally be a great improvement. I found the module to be rather complex. Instead of using all of it I opted for borrowing the logic I wanted to use (policies), and bastardized it somewhat to make it do what I wanted. If brownfield deployment was more clear, I might have used it in its entirety. Correct me if I am wrong, as I could have missed some recent documentation updates.

Of the proposed solutions I initially found the second proposal to be favorable. There are so many complex inner workings, as you put it somewhere, that I think it would be difficult to refactor them and share information between the modules. Some problems with dependencies or information sharing could emerge, and end up making things harder rather than simpler 🤔

After some thought I can see the benefits of calling the different pieces of code "on their own", much like it was done with the ALZ Bicep approach. 💪 I liked the orchestrator method, and enjoyed making my own there for a little while. It helped me understand the moving pieces better. For me it feels more natural to split these different parts in different states, which could be simpler with separate modules. Inter-connectivity and dependencies would be more challenging, but maybe that is outweighed by the better management 🤷‍♂️

I guess my conclusion is both proposals would solve some pain points, and general simplification of the module itself would also help a whole lot. If it is doable, I suppose the first proposal could work well, as long as it doesn't require massive amounts of dependency mapping for anyone deploying it. This answer from @jtracey93 is spot on. More documentation, expanded contribution guide, and simplified code 👍

Can you shed some light on the probable roadmap for changes in this module time-wise? Do you have any concrete idea of when this next iteration could realistically be available? I have some designs in the to-do pile that might require a re-think if this module was drastically changed.

matt-FFFFFF commented 1 year ago

Thanks for all the contributions so far, we really appreciate it. To re-iterate, we see a bright future for this module and are working hard on shaping the next version. Migration is important to us and we want to make sure there's a path to upgrade.

One item we'd like feedback on is the use of a new provider. This would make the resultant HCL much simpler as the complex data processing will be performed in the provider, written in Go. This will make things much faster, more testable and would be easier to consume by the caller.

erDiagram
  "Customer /lib dir" ||--|| alz-provider : "reads"
  "main.tf" ||--}| alz-module : "calls"
  "alz-module" ||--}| archetypes-sub-module : "calls"
  alz-module ||--|| alz-provider : "declares"

To give an example of how this might look, please see some mock code below:

terraform {
  required_providers {
    alzlib = {
      source  = "Azure/alzlib"
      version = ">= 1.0.0"
    }
    azurerm = {
      source  = "hashicorp/azurerm"
      version = ">= 3.7.0"
    }
  }
}

# Declare provider config, point to custom lib directory
provider "azurerm" {
  features {}
}

provider "alzlib" {
  custom_lib_dir   = "./lib"
  # default location for, e.g. policy MIs
  default_location = "westeurope"
}

# get the tenant id to use for the tenant root management group
data "azurerm_client_config" "current" {}

# Declare root archetype, based on built-in root definition baked into provider
data "alzlib_archetype" "root" {
  base_archetype = "root"
  name           = "alz-root"
  display_name   = "ALZ root"
  parent_id      = data.azurerm_client_config.current.tenant_id
}

# Declare landing zones archetype, based on built-in landing-zones definition baked into provider
# but adding additional policy assignments
data "alzlib_archetype" "landing_zones" {
  base_archetype = "landing_zones"
  name          = "landing-zones"
  display_name  = "landing zones"
  parent_id     = data.alzlib_archetype.root.name

  policy_assignment_additions  = ["my-assignment"]
  policy_assignment_parameters = {
    my-assignment = {
      myParameter = "this"
    }
  }
}

# Declare custom archetype, based on built-in empty definition baked into provider
# adding everything necessary
data "alzlib_archetype" "custom" {
  base_archetype = "empty"
  name           = "custom"
  display_name   = "custom"
  parent_id      = data.alzlib_archetype.landing_zones.name

  policy_assignment_additions  = ["my-assignment2", "myassignment3"]
  policy_assignment_parameters = {
    my-assignment2 = {
      myParameter = true
    }
  }
}

# create root management group and policy/roles
module "archetype_root" {
  source       = "./modules/archetype"
  archetype    = data.alzlib_archetype.root
}

# create landing-zones management group and policy/roles
module "archetype_landing_zones" {
  source       = "./modules/archetype"
  archetype    = data.alzlib_archetype.root
}

# create custom management group and policy/roles
module "archetype_custom" {
  source       = "./modules/archetype"
  archetype    = data.alzlib_archetype.custom
}

@birdnathan @mw8er @MarcelHeek @tlfzhylj @davelee212 @hlokensgard @adhodgson1 @torivara @jtracey93

LaurentLesle commented 1 year ago

I would leave the compliance management in the ALZ module to manage the management group structure, policies, initiatives, assignments, RBAC to Management groups...

Networking, management, identity and security are external dependencies that are injected as needed as input parameters to the ALZ module's policy parameters, access_control or others. Therefore most of the time they have to be created before you can deploy the compliance management.

If you decouple it that way it will leave the end-customer with more options on how to structure the platform services to deploy in the platform landingzones. It will also help with day 2 activities managed most of the time by different teams in large enterprise customers. For instance, it would segregate identity changes from networking firewall rules or route table entries.

With that approach:

I like the idea of moving the artefact management within a provider and expose them as data sources. It will simplify a lot the graph. You can also explore how to include (or combine - happy to help here) the CAF naming provider to provide a single provider to name resources - https://registry.terraform.io/providers/aztfmod/azurecaf/1.2.24-preview/docs/data-sources/azurecaf_name

My last point is to pay attention to the migration. Not sure the current moved block as it stands today would be enough. Would this split of modules result to multiple tfstates? Would you keep the structure of the current tfvars the same or would this migration also require a change in the tfvars as well?

So more in favour of option 1 to support dependency injection of external resources into the ALZ module and don't forget identity!

matt-FFFFFF commented 1 year ago

Locking issue - thank you all for you feedback! We really appreciate it. We will consider this feedback and publish our roadmap when ready.