gardener / gardener-extension-provider-azure

Gardener extension controller for the Azure cloud provider (https://azure.microsoft.com).
https://gardener.cloud
Apache License 2.0
9 stars 77 forks source link

[CPM] Restoration of cluster fails if it's `Infrastructure` resource on the source `Seed` was annotated with `migration.azure.provider.extensions.gardener.cloud/zone` #827

Open plkokanov opened 2 months ago

plkokanov commented 2 months ago

How to categorize this issue?

/area control-plane-migration /kind bug /platform azure

What happened: During control plane migration of an HA shoot cluster (using zones z1, z2, and z3), for which the infrastructure resource is annotated with migration.azure.provider.extensions.gardener.cloud/zone, the infrastructure resource is not successfully restored with the following error:

* creating Subnet: (Name "<vnet-name>-nodes-z3" / Virtual Network Name "<vnet-name>" / Resource Group "<resource-group-name>"): network.SubnetsClient#CreateOrUpdate: Failure sending request: StatusCode=400 -- Original Error: Code="NetcfgSubnetRangesOverlap" Message="Subnet '<vnet-name>-nodes-z3' is not valid because its IP address range overlaps with that of an existing subnet in virtual network '<vnet-name>'." Details=[]
  with azurerm_subnet.workers-z3,
  on main.tf line 167, in resource "azurerm_subnet" "workers-z3":
 167: resource "azurerm_subnet" "workers-z3" {
* deleting Subnet: (Name "<vnet-name>-nodes" / Virtual Network Name "<vnet-name>" / Resource Group "<resource-group-name>"): network.SubnetsClient#Delete: Failure sending request: StatusCode=400 -- Original Error: Code="InUseSubnetCannotBeDeleted" Message="Subnet<vnet-name>-nodes is in use by /subscriptions/<omitted>/resourceGroups/<resource-group-name>/providers/Microsoft.Network/networkInterfaces/<nic-id>-NIC/ipConfigurations/<nic-id>-NIC and cannot be deleted. In order to delete the subnet, delete all the resources within the subnet. See aka.ms/deletesubnet." Details=[]]

Basically, during the restore phase of control plane migration for the inrastructure resource the provider-azure extension tried to delete the <vnet-name>-nodes subnet and create <vnet-name>-nodes-z3. This seems to have happened because the infrastructure resource in the destination seed did not have an migration.azure.provider.extensions.gardener.cloud/zone: "3" annotation.

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

The migration.azure.provider.extensions.gardener.cloud/zone annotation is put on the infrastructure resource via a mutating webhook here: https://github.com/gardener/gardener-extension-provider-azure/blob/b859d7beb856dcc3e461e36da8f6309ccd6115f5/pkg/webhook/infrastructure/layout.go#L132-L141

In this case, this mutating code did not get executed because of the following:

  1. As part of normal reconciliation of the infrastructure resource its .status.providerStatus field is saved in the .status.state.providerStatus.
  2. During the migrate phase of CPM gardenlet takes this .status.state.savedProviderStatus and saves it in the ShootState
  3. During the restore phase of CPM gardenlet creates an infrastructure resource in the destination seed, then it copies the .status.state.savedProviderStatus from the ShootState and adds it to the infrastructure's .status.state.savedProviderStatuss field.
  4. Afterwards, gardenlet annotates the the infrastructure resource with gardener.cloud/operation: restore to trigger restoration.

During the updates to the infrastructure resource in 3 and 4 the mutating webhook does not make any changes as it exits early due to these checks: https://github.com/gardener/gardener-extension-provider-azure/blob/b859d7beb856dcc3e461e36da8f6309ccd6115f5/pkg/webhook/infrastructure/layout.go#L117-L130

Even if the status.providerState is patched with the one from the status.state.providerState, the mutating webhook would still not perform any changes because the status.providerState would contain the following:

  "providerStatus": {
    "apiVersion": "azure.provider.extensions.gardener.cloud/v1alpha1",
    "availabilitySets": [],
    "kind": "InfrastructureStatus",
    "networks": {
      "layout": "MultipleSubnet",

Hence nil is returned here: https://github.com/gardener/gardener-extension-provider-azure/blob/b859d7beb856dcc3e461e36da8f6309ccd6115f5/pkg/webhook/infrastructure/layout.go#L128-L130

What you expected to happen: Cluster to be restored successfully.

Environment:

plkokanov commented 1 month ago

/assign