Azure / Enterprise-Scale

The Azure Landing Zones (Enterprise-Scale) architecture provides prescriptive guidance coupled with Azure best practices, and it follows design principles across the critical design areas for organizations to define their Azure architecture
https://aka.ms/alz
MIT License
1.65k stars 934 forks source link

Feature Request - Cost Management, Quotas & Budgets #1105

Closed DevSecNinja closed 1 year ago

DevSecNinja commented 1 year ago

Describe the solution you'd like

One of the key pillars of our Well-Architected Framework is Cost Optimization. With our Enterprise Scale model, we have a great opportunity to support our customers in efficiently managing costs. I'm opening an issue to start a broader conversation on this topic.

Quotas & Budgets

We suggest customers to consider setting up budgets which is a great idea.

Subscription Level budgets

I always suggest customers to set up budget alerts for at least their core landing zone subscriptions to ensure we meet the initial estimations. From a security perspective, we have an extra alert in case resource misuse takes place (e.g. cryptojacking). It's also super important to keep track of sandbox subscriptions as these are often overlooked. To align with the best practices, it would be great if we could integrate budget alerts in the subscriptions managed by the various Enterprise Scale modules. What are your thoughts on that?

In my personal environment, I'm using the Terraform Enterprise Scale module and I have a JSON file that lists my subscriptions including the budget:

{
    "platform": {
        ...
    },
    "landing_zones": {
        "corp": {
            "subscriptions": [
                {
                    "name": "lz-corp-01",
                    "id": "...",
                    "budget": 130
                }
            ]
        }
    },
    "decommissioned": {},
    "sandbox": {}
}

In my main.tf, I gather the subscription IDs:

subs_landing_zones_corp_id   = [for sub in local.subscriptions.landing_zones.corp.subscriptions : sub.id]
subs_landing_zones_corp_name = [for sub in local.subscriptions.landing_zones.corp.subscriptions : sub.name]

And on the landing zones I configure the budgets like so:

resource "azurerm_consumption_budget_subscription" "lz_budget" {
  for_each = { for subscription in local.subscriptions.landing_zones.corp.subscriptions : subscription.name => subscription }

  name            = "${local.generic.org.root_id}-budget-${each.key}-${each.value.budget}-eur"
  subscription_id = "/subscriptions/${each.value.id}"

  amount     = each.value.budget
  time_grain = "BillingMonth"

  time_period {
    start_date = "2022-06-01T00:00:00Z"
  }

  notification {
    enabled   = true
    threshold = 80.0
    operator  = "EqualTo"

    contact_emails = [
      local.generic.org.owner.email
    ]

    contact_roles = [
      "Owner"
    ]
  }

  notification {
    enabled        = true
    threshold      = 90.0
    operator       = "GreaterThan"
    threshold_type = "Forecasted"

    contact_emails = [
      local.generic.org.owner.email
    ]

    contact_roles = [
      "Owner"
    ]
  }
}

Resource Level quotas

For resources like a Log Analytics workspace, it's critical to keep an eye on the costs and set a budget cap. Especially when a single Log Analytics workspace is used by Sentinel as well. This is to prevent a spike in costs in case one or more applications suddenly start to send out huge amount of error logs. What is our general best practice around this?

With the Terraform ALZ module, it's a bit difficult to configure. I was expecting something like this to work:

# Configure the management resources settings.
locals {
  configure_management_resources = {
    location = "westeurope
    tags = {}
    settings = {
      log_analytics = {
        enabled = true
        config = {
          retention_in_days                           = 30
          daily_quota_gb                               = 10
....

While it needs to be set in an advanced block:

# Configure the management resources settings.
locals {
  configure_management_resources = {
   ....
    advanced = {
      custom_settings_by_resource_type = {
        azurerm_log_analytics_workspace = {
          management = {
              daily_quota_gb = 10
          }
        }
      }
    }

Any feedback is welcome - thank you!

jtracey93 commented 1 year ago

Thanks again for the feedback @DevSecNinja.

For sub level quota we provide a policy in ALZ that can be assigned here: https://www.azadvertizer.net/azpolicyadvertizer/Deploy-Budget.html

We are also tracking this on our sub vending modules as feature requests that we plan to implement:

As for resource level quotas, we don't want to set these on things like platform resources for LAWs as we dont want anything to be missed. What happens if a critical event happens, but the quota has been hit for the day, the customer is then potentially in the dark with no logs šŸ˜¢

For things inside a subscription as per the subscription democratization principle of ALZ we really want to empower the application/workload teams to take responsibility, as per our guidance, for their costs etc. So they should be setting budgets and governing/managing their subscription to support thier workload as they know it best.

Otherwise, if things are set by platform teams upon them that are too specific within app/workload landing zones, you risk slowing down agility and potentially increase the risk of "shadow IT". This relates to the whole operating model conversation which ALZ leans towards Enterprise Operations

Hope this helps and let us know your thoughts if there is anything additional, we should/could do in ALZ to assist on top of what we already do have today and are planning to do as shared above

Thanks

Jack

DevSecNinja commented 1 year ago

Thanks again for the feedback @DevSecNinja.

For sub level quota we provide a policy in ALZ that can be assigned here: https://www.azadvertizer.net/azpolicyadvertizer/Deploy-Budget.html

We are also tracking this on our sub vending modules as feature requests that we plan to implement:

As for resource level quotas, we don't want to set these on things like platform resources for LAWs as we dont want anything to be missed. What happens if a critical event happens, but the quota has been hit for the day, the customer is then potentially in the dark with no logs šŸ˜¢

For things inside a subscription as per the subscription democratization principle of ALZ we really want to empower the application/workload teams to take responsibility, as per our guidance, for their costs etc. So they should be setting budgets and governing/managing their subscription to support thier workload as they know it best.

Otherwise, if things are set by platform teams upon them that are too specific within app/workload landing zones, you risk slowing down agility and potentially increase the risk of "shadow IT". This relates to the whole operating model conversation which ALZ leans towards Enterprise Operations

Hope this helps and let us know your thoughts if there is anything additional, we should/could do in ALZ to assist on top of what we already do have today and are planning to do as shared above

Thanks

Jack

That looks good Jack, glad it's on the backlog for the LZ vending. I wasn't aware of that policy and I think most of the Terraform CAF module users as it needs to be manually assigned. Would be great if there was a definition that uses forecasted budgets to make sure the owner gets the alert based on the trend. And it would be nice if there was a parameter for it in the configure_management_resources & configure_connectivity_resources settings so that we also set them on the identity/management/connectivity subscriptions. Maybe I'll dive into it and write the documentation for it in the Terraform repo.

Completely agree with your points and that setting a random quota by default isn't the right thing to do. But it's currently kind of difficult to configure and we don't document too much about it. (In the form of how to configure it and what our best practices are) What do you think of mentioning budgets & quota's more specifically in the documentation? E.g.

We could include the recommendations in the other repositories as well.

Aligning this with #1000, if we separate the operational logs (LAW) from security logs (LAW + Sentinel), the quota becomes more interesting as well. Just to give you an example: after deploying an AKS cluster, my LAW usage went up big time and caused me to run out of my Visual Studio subscription credits for that month.

Thanks!

SteveBurkettNZ commented 1 year ago

Agree that this Deploy-Budget policy isn't well publicised, didn't know it existed and had gone down the same route as DevSecNinja (though less elegantly...)

Can see the problem with setting a resource quota on the LAW(s), but it would be nice to be given a heads up (via an alert?) that it's gone way over the usual ingestion size and is going to result in a sizeable increase in billing.

jtracey93 commented 1 year ago

We have also added and assigned this new policy by default in ALZ to help with finding unused resources to help with cost management - https://github.com/Azure/Enterprise-Scale/wiki/ALZ-Policies#:~:text=Deny-,Audit%2DUnusedResourcesCostOptimization,-Audit%2DUnusedResourcesCostOptimization

We also are about to release a new sandbox guidance page in CAF to help with this as well around finding other policies to use