Azure / bicep

Bicep is a declarative language for describing and deploying Azure resources
MIT License
3.17k stars 730 forks source link

[AVM Module Issue]: DNS Zones - Deployment History limit #14139

Open aavdberg opened 2 months ago

aavdberg commented 2 months ago

Check for previous/existing GitHub issues

Issue Type?

I'm not sure

Module Name

avm/res/network/dns-zone

(Optional) Module Version

No response

Description

I am deploying through the 'br/public:avm/res/network/dns-zone:0.2.4' module Dnszones

But running to the following problem:

image

I deploy seventy domains with records in it.

(Optional) Correlation Id

No response

avm-team-linter[bot] commented 2 months ago

@aavdberg, thanks for submitting this issue for the avm/res/network/dns-zone module!

[!IMPORTANT] A member of the @azure/avm-res-network-dnszone-module-owners-bicep or @azure/avm-res-network-dnszone-module-contributors-bicep team will review it soon!

ChrisSidebotham commented 2 months ago

Hi @aavdberg.

Thanks for raising this issue. I am intrigued are you supplying domain + Records totalling to more than 800?

The 800 limit comes from the resource group level limits as noted here: https://learn.microsoft.com/en-us/azure/azure-resource-manager/templates/deployment-history-deletions?tabs=azure-powershell

Could it be your deployment history is more than 800? in the CI Environment we have script to remove deployment history where we see failures but there is a time delay between this being removed from the visible view in the portal to the metadata being removed on the server side.

@AlexanderSehr May have some more supporting information for this

AlexanderSehr commented 2 months ago

Hey @aavdberg,

thanks for bringing this up. This is indeed a curious case since there is an auto-cleanup mechanism for deployments at this scope, i.e., if you're closing in on the 800 deployments limit, it will start removing deployments as per the FIFO principle.

That means, you somewhow managed to create so many deployments, that Azure is actually not able to remove deployments in time. This could, for example, happen in the extreme case of creating 800 deployments at once, but can happen for less. You mention 'only' 70, which should only run in the aforementioned issue, if 70 is higher than the threshold by which Azure starts removing deployments, which as per the link that @ChrisSidebotham provided is at 700. I assume you deploy other resources with your DNS Zones that may pump the # of dpeloyments to above 100?

@alex-frankel would you happen to have any advice / a best practice in mind how to avoid this issue when deploying templates that make heavy use of deployments?

@ChrisSidebotham, the script you're referring to is most important for the management group level as there is no auto-cleanup to be found. But we did in fact also use it for the subscription level in the past if we tested our entire library multiple times a day.

teemukom commented 2 months ago

If the resource group has CanNotDelete lock (as it has when using this module with WAF parameters) it prevents deletions from the deployment history.

ChrisSidebotham commented 2 months ago

@aavdberg - Do you have an update on this with regards to the above comments?

aavdberg commented 2 months ago

We have seventy domains that we deploy through Azure DevOps pipeline. We use now @batchsize(1) to give the automatic cleanup time to do its thing and then it's working, but that makes the deployment take longer. We did try it also with a batchsize of ten but it also fails because the automatic cleanup does not go fast enough.

The domains get deployed to a resourcegroup and the scope is also to resourcegroup in the bicep file.

Every domain has several records.

So, every record in the domain gets a deployment created.

// Module Public DNS Zone
@batchSize(1)
module publicdnszone 'br/public:avm/res/network/dns-zone:0.2.4' = [
  for dnszone in varPublicDnsZones: {
    name: 'deploy-PublicDnsZone-${dnszone.name}'
    params: {
      name: dnszone.name
      location: 'global'
      lock: {
        kind: 'None'
        name: 'lock'
      }
      a: dnszone.recordSets.?a
      aaaa: dnszone.recordSets.?aaaa
      caa: dnszone.recordSets.?caa
      cname: dnszone.recordSets.?cname
      mx: dnszone.recordSets.?mx
      ptr: dnszone.recordSets.?ptr
      soa: dnszone.recordSets.?soa
      srv: dnszone.recordSets.?srv
      txt: dnszone.recordSets.?txt
      tags: {
        createdBy: 'Bicep through Azure DevOps'
        solutionName: parSolutionName
      }
    }
  }
]

Hopes this gives more clearity about the problem.

AlexanderSehr commented 2 months ago

We have seventy domains that we deploy through Azure DevOps pipeline. We use now @batchsize(1) to give the automatic cleanup time to do its thing and then it's working, but that makes the deployment take longer. We did try it also with a batchsize of ten but it also fails because the automatic cleanup does not go fast enough.

The domains get deployed to a resourcegroup and the scope is also to resourcegroup in the bicep file.

Every domain has several records.

So, every record in the domain gets a deployment created.

// Module Public DNS Zone
@batchSize(1)
module publicdnszone 'br/public:avm/res/network/dns-zone:0.2.4' = [
  for dnszone in varPublicDnsZones: {
    name: 'deploy-PublicDnsZone-${dnszone.name}'
    params: {
      name: dnszone.name
      location: 'global'
      lock: {
        kind: 'None'
        name: 'lock'
      }
      a: dnszone.recordSets.?a
      aaaa: dnszone.recordSets.?aaaa
      caa: dnszone.recordSets.?caa
      cname: dnszone.recordSets.?cname
      mx: dnszone.recordSets.?mx
      ptr: dnszone.recordSets.?ptr
      soa: dnszone.recordSets.?soa
      srv: dnszone.recordSets.?srv
      txt: dnszone.recordSets.?txt
      tags: {
        createdBy: 'Bicep through Azure DevOps'
        solutionName: parSolutionName
      }
    }
  }
]

Hopes this gives more clearity about the problem.

It does, thanks @aavdberg.

And it does yet again surface two challenges with Deployment objects in Azure:

  1. The cleanup apparently is too slow (which is quite suprising to me, as the threshold to start deleting is 700, and the limit of deployments 800)
  2. Deployments take a lot longer than native resource deployments. For solutions that deploy some heavy wheights that anyways take a few minutes to deploy they make a lot of sense - but for something where you deploy e.g. 70 instances of a resource, I would personally recommend to use the AVM module as a reference to implement a Bicep-native resource deployment. This would potentially only create 1 deployment instead of 70 (if implemented accordingly) and be faster as a result. I used to see similar things happening with e.g. PolicyAssignments where one tried to deploy like 300 policy assignments which just took ages, while the Bicep-native implementation took only a few minutes or less.

I guess what I'm trying to say is that there is always a trade-off. The good news is, that the PG is aware and are working on speeding deployments up. ETA unknown though.

ChrisSidebotham commented 1 month ago

@alex-frankel - Would you be able to assist in an ETA for the deployments work Alex mentioned above? Is there a better place for this issue regarding deployments than the current BRM Repository?

ChrisSidebotham commented 1 month ago

@sydkar - Any update on moving this to the correct area?

sydkar commented 1 month ago

@ChrisSidebotham No update right now, trying to get an ETA on the deployments speed up work. I'll let you know by the end of the week.

alex-frankel commented 6 days ago

Sorry about the delay on this one. Is @teemukom's comment accurate? If so, then that is likely the source of the issue no?

@aavdberg, can you take a look at the Deployment resources in the Resource Group (there is a "Deployments" tab in the table of contents of the Resource Group blade in the portal)? Does it eventually get pruned back down to 700? If so, then that suggests @AlexanderSehr is correct that we are not able to keep up with the pace of deployments getting created.

The fact that the deployment works with @batchSize(1) also suggests it's our inability to keep up.

@aavdberg if this is still occurring, can you share a recent correlation ID and we can investigate further?

alex-frankel commented 6 days ago

I'm also curious what the var varPublicDnsZones looks like. That will help us understand how many sub-modules will be created for a given record.

aavdberg commented 5 days ago

Send you a ping in teams @alex-frankel

aavdberg commented 5 days ago

When using a batchsize(1) then it slows down the deployment and have the deployment history cleanup time to clean up the old deployments.