Azure / terraform-azurerm-caf-enterprise-scale

Azure landing zones Terraform module
https://aka.ms/alz/tf
MIT License
872 stars 574 forks source link

Plan fails when linking 8 or more vnets to the private DNS zones #750

Closed tlfzhylj closed 1 year ago

tlfzhylj commented 1 year ago

Community Note

Versions

terraform: 1.5.1

azure provider: 3.62

module: 4.0.2

Description

Describe the bug

Using virtual WAN and virtual hub, we have to link the peered vnets to the private dns zones. If not things will not work when using private endpoints.

I'm using the virtual_network_resource_ids_to_link option in the module to link the vnets to the private DNS zones.

I have now linked 8 vnets this way, which results in a lot of resources in terraform state.

I now get the error message when doing a terraform plan: virtualnetworklinks.VirtualNetworkLinksClient#Get: Failure sending request: StatusCode=429 -- Original Error: context deadline exceeded

I belive this error message says that I'm doing to many requests against the Azure API.

How can I solve this?

Thanks.

matt-FFFFFF commented 1 year ago

Hi @tlfzhylj

This api limit is known about and unfortunately there isn't much we can do.

Have you tried using private dns resolver to centralise this?

ghost commented 1 year ago

This issue has been automatically marked as stale because it has been marked as requiring author feedback but has not had any activity for 7 days. It will be closed if no further activity occurs within 7 days of this comment.

tlfzhylj commented 1 year ago

No, I haven't tried the private dns resolver. How will that work? Will I hook up the private dns resolver in one vnet, connected to the virtual hub, and have all private dns zones linked to that vnet?

... and then change the dns for the other vnets connected to the virtual hub, to the private dns resolver?

liuwuliuyun commented 1 year ago

Hi @tlfzhylj , thanks for raising this issue. I think you could just create a custom timeout to avoid this.

  1. Does AzureRM provider handle and retry on Error Code 429? Yes, it does that automatically and you dont need to worry about that.

  2. Why do you see this 429 error? Because the context time default is set to 5 min in AzureRM provider. But it is unable to run GET over 1000 resources so the program times out.

  3. How to solve this? I would suggest @matt-FFFFFF to add a custom timeout block to this resource and expose timeout value with a default setting to all customers. Then, @tlfzhylj could just set the timeout for read to "2h" and the error will not appear.

Related Documents on Timeouts: Timeouts - Configuration Language | Terraform | HashiCorp Developer

tlfzhylj commented 1 year ago

Hi, @liuwuliuyun Thanks for taking the time to answer.

How can I create a custom timeout, when the linking happens inside the module?

It doesn't seems like the timeouts block is set at the resource inside the module, and therefore not configurable: https://github.com/Azure/terraform-azurerm-caf-enterprise-scale/blob/24df4484e4840d308e07d358a6fade7ebb56a16e/resources.connectivity.tf#L492

liuwuliuyun commented 1 year ago

If @tlfzhylj you want to solve this now without waiting for the module to update. There are two ways.

  1. Using terragrunt with custom auto-retry rule see here [This may not work on second thoughts]
  2. Fork this repo -> change the timeout in the block you mentioned in last comment -> use your own github fork repo as module source see here

Hope this helps.

KeynesLee commented 1 year ago

Change the timeout of terraform is not a solution, neither a good workaround.

In my environment, there are more than 50 VNETs, it requires almost 2 hours to complete the "terraform apply" , everytime. And the situation was getting worse and worse. As the number of both private DNS zones and VNETs are increasing. We suffered for a long time, and start to be regret adopt CAF.

I think, this could be a solution : Centralize all private DNS zone's virtual network link to 1 VNET (DNS forwarder VM or DNS Private Resolver located), then configure "DNS server" setting on VNET, to indicate to the IPs of DNS forwarder VM or DNS private Resolver.

But how to make this in CAF terraform codes ?

matt-FFFFFF commented 1 year ago

As a workaround, we plan to expose the timeouts for the azurerm_private_dns_zone_virtual_network_link resource.

I do understand this is only a workaround, a better architecture is one that @KeynesLee posted.

We document this here: https://learn.microsoft.com/en-us/azure/dns/private-resolver-architecture using private DNS resolver.

tlfzhylj commented 1 year ago

@KeynesLee How do you manage to have 50 VNETs?

My stopped working at 8 VNETs. My workaround just now has been to not enable all the private dns zones. Only the ones I actually need, but I don't like it, because whenever some team starting to use a new service with private endpoint, they will run into problems before we have enabled the private DNS zone. But I have really just bought myself some time with this workaround, because I will hit the limit again very soon.

It was a pain to clean up when I tried to disable the unused private dns zones. I had to clean up both the Terraform state, and the resources directly in Azure to get things working again. Luckily PowerShell exist and I managed to loop through all the dns zones and removed the vnet links before deleting the zone. After that I needed to clean up the tf state manually, and remove the dns zones and links I wanted to disable.

KeynesLee commented 1 year ago

@tlfzhylj

As @liuwuliuyun mentioned, we configured a custom terraform timeout. You need to

  1. First, decrease Private DNS Zones or VNETs first, then run terraform apply . Make sure terraform can run smoothly.
  2. Find modify the CAF module file ./caf-eslz/.terraform/modules/enterprise_scale/resources.connectivity.tf
  3. add timeouts as below figure in resource _"azurerm_private_dnszone" block image
  4. Make ANY change, then run terraform apply again, and make sure it completed. Then timeouts settings effect from now on
KeynesLee commented 1 year ago

@matt-FFFFFF

We wondered, in settings.connectivity.tf, if

  1. we set configure enable_private_dns_zone_virtual_network_link_on_spokes = false
  2. then configure virtual_network_resource_ids_to_link to indicate to the dns private resolver's vnet. Then configure all spoke vnet's DNS Server to the dns private resolver Could this works ?