[CPM] Restoration of cluster fails if it's `Infrastructure` resource on the source `Seed` was annotated with `migration.azure.provider.extensions.gardener.cloud/zone` #827
What happened:
During control plane migration of an HA shoot cluster (using zones z1, z2, and z3), for which the infrastructure resource is annotated with migration.azure.provider.extensions.gardener.cloud/zone, the infrastructure resource is not successfully restored with the following error:
* creating Subnet: (Name "<vnet-name>-nodes-z3" / Virtual Network Name "<vnet-name>" / Resource Group "<resource-group-name>"): network.SubnetsClient#CreateOrUpdate: Failure sending request: StatusCode=400 -- Original Error: Code="NetcfgSubnetRangesOverlap" Message="Subnet '<vnet-name>-nodes-z3' is not valid because its IP address range overlaps with that of an existing subnet in virtual network '<vnet-name>'." Details=[]
with azurerm_subnet.workers-z3,
on main.tf line 167, in resource "azurerm_subnet" "workers-z3":
167: resource "azurerm_subnet" "workers-z3" {
* deleting Subnet: (Name "<vnet-name>-nodes" / Virtual Network Name "<vnet-name>" / Resource Group "<resource-group-name>"): network.SubnetsClient#Delete: Failure sending request: StatusCode=400 -- Original Error: Code="InUseSubnetCannotBeDeleted" Message="Subnet<vnet-name>-nodes is in use by /subscriptions/<omitted>/resourceGroups/<resource-group-name>/providers/Microsoft.Network/networkInterfaces/<nic-id>-NIC/ipConfigurations/<nic-id>-NIC and cannot be deleted. In order to delete the subnet, delete all the resources within the subnet. See aka.ms/deletesubnet." Details=[]]
Basically, during the restore phase of control plane migration for the inrastructure resource the provider-azure extension tried to delete the <vnet-name>-nodes subnet and create <vnet-name>-nodes-z3. This seems to have happened because the infrastructure resource in the destination seed did not have an migration.azure.provider.extensions.gardener.cloud/zone: "3" annotation.
How to reproduce it (as minimally and precisely as possible):
In this case, this mutating code did not get executed because of the following:
As part of normal reconciliation of the infrastructure resource its .status.providerStatus field is saved in the .status.state.providerStatus.
During the migrate phase of CPM gardenlet takes this .status.state.savedProviderStatus and saves it in the ShootState
During the restore phase of CPM gardenlet creates an infrastructure resource in the destination seed, then it copies the .status.state.savedProviderStatus from the ShootState and adds it to the infrastructure's .status.state.savedProviderStatuss field.
Afterwards, gardenlet annotates the the infrastructure resource with gardener.cloud/operation: restore to trigger restoration.
Even if the status.providerState is patched with the one from the status.state.providerState, the mutating webhook would still not perform any changes because the status.providerState would contain the following:
How to categorize this issue?
/area control-plane-migration /kind bug /platform azure
What happened: During control plane migration of an HA shoot cluster (using zones
z1
,z2
, andz3
), for which the infrastructure resource is annotated withmigration.azure.provider.extensions.gardener.cloud/zone
, the infrastructure resource is not successfully restored with the following error:Basically, during the
restore
phase of control plane migration for the inrastructure resource theprovider-azure
extension tried to delete the<vnet-name>-nodes
subnet and create<vnet-name>-nodes-z3
. This seems to have happened because the infrastructure resource in the destination seed did not have anmigration.azure.provider.extensions.gardener.cloud/zone: "3"
annotation.How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
The
migration.azure.provider.extensions.gardener.cloud/zone
annotation is put on the infrastructure resource via a mutating webhook here: https://github.com/gardener/gardener-extension-provider-azure/blob/b859d7beb856dcc3e461e36da8f6309ccd6115f5/pkg/webhook/infrastructure/layout.go#L132-L141In this case, this mutating code did not get executed because of the following:
.status.providerStatus
field is saved in the.status.state.providerStatus
.migrate
phase of CPMgardenlet
takes this.status.state.savedProviderStatus
and saves it in theShootState
restore
phase of CPMgardenlet
creates an infrastructure resource in the destination seed, then it copies the.status.state.savedProviderStatus
from theShootState
and adds it to the infrastructure's.status.state.savedProviderStatuss
field.gardenlet
annotates the the infrastructure resource withgardener.cloud/operation: restore
to trigger restoration.During the updates to the infrastructure resource in 3 and 4 the mutating webhook does not make any changes as it exits early due to these checks: https://github.com/gardener/gardener-extension-provider-azure/blob/b859d7beb856dcc3e461e36da8f6309ccd6115f5/pkg/webhook/infrastructure/layout.go#L117-L130
Even if the
status.providerState
is patched with the one from thestatus.state.providerState
, the mutating webhook would still not perform any changes because thestatus.providerState
would contain the following:Hence nil is returned here: https://github.com/gardener/gardener-extension-provider-azure/blob/b859d7beb856dcc3e461e36da8f6309ccd6115f5/pkg/webhook/infrastructure/layout.go#L128-L130
What you expected to happen: Cluster to be restored successfully.
Environment:
kubectl version
):