Open andrerun opened 1 year ago
@kon-angelo can you please add a short description of the specific mechanics which cause Terraform not to adopt the created NatGateway, and which make a fix in Gardener code impractical?
@andrerun Terraform templates have 2 directives to declare resources resource
and data
. The first one instructs TF to create and manage a resource, then latter to adopt it. You can see that for our use case we use the resource
directive.
In the scenario that you describe, Azure tries to provision a NGW but the provisioning fails. I am not 100% sure about the original Terraform run (error or timeout), but after the first run there is a NGW resource listed on Azure which is not imported into the TF state because the creation call was unsuccessful. All subsequent Terraform runs are thus failing because TF will complain for any existing resource that is not in its state and you are essentially deadlocked.
which make a fix in Gardener code impractical?
The communication between TF and Azure is opaque from Gardener's perspective. We simply declare the target state and let TF perform the operations. As you see in this case there are some edge cases that prevent them from completing.
By any means, while the throttling is active no operation such as update or delete can go through, hence there is no way to proceed. What would be ideal is for Gardener to have a way to break the deadlock post-incident. So what can we do ?
The first suggestion would be to try and adopt the created resources.
The other thing that you could do is directly delete the resource in Azure but again this requires quite a bit of effort to orchestrate similar to my previous point.
The TLDR; is that for the time being we heavily rely on Terraform. TF is useful to declaratively manage infra resources, but at the same time we do not have the ability to intervene and change its behavior much. Introducing workarounds like the ones I mentioned above as a wrapper around current terraform is likely to cause more issues than solve. Because the workarounds discussed here require a lot of effort to integrate into the extension, instead the likely solution would be to proceed with our terraform removal story where we could have more control over such incidents.
How to categorize this issue? /area robustness /kind bug /platform azure
This ticket tracks an issue for which a short term technical solution is not possible. It has however caused both substantial pain and perception of poor Gardener robustness to one or more customers. A customer experiences multiple persistent shoot creation failures, and are forced to perform manual cleanup of infrastructure objects created by Gardener. The goal of this ticket is to communicate customer impact, and potentially drive/inform a longer-term change.
What happened: In the context of a shoot creation workflow, Azure reported a NatGateway creation failure due to throttling, and created a NatGateway object with failed state. Terraform did not adopt the newly created gateway. The gateway object was abandoned as a zombie which would not be deleted by Gardener, and whose clashing name disrupts further attempts by Gardener to create a NatGateway required as part of shoot creation. The outcome is a shoot with persistently failed creation, plus infrastructure object which requires manual cleanup.
The presumed Azure throttling restriction is subscription-specific, so an occurrence affects a single Gardener customer, but in an automated scenario, is likely to result in multiple failed shoots for that customer.
The problem cannot be immediately resolved in Gardener, because the underlying cause, as currently understood, is a conflict between Azure's failure mode in that specific scenario, and Terraform. TBD: A more precise description of these underlying mechanics is to be added to this ticket shortly.
Anything else we need to know?:
Environment:
kubectl version
):