citrix / terraform-provider-citrixadc

Part of NetScaler Automation Toolkit | https://github.com/netscaler/automation-toolkit
https://registry.terraform.io/providers/citrix/citrixadc
Apache License 2.0
119 stars 59 forks source link

[BUG] terraform state gets corrupted without ability to restore #1082

Open im-dim opened 1 year ago

im-dim commented 1 year ago

Contact us

For any immediate issues or help , reach out to us at NetScaler-AutomationToolkit@cloud.com !

Bug Report

I get the below error if a configuration object was built by terraform but then was deleted from VPX (for any reason).

Error: [ERROR] FindResourceArrayWithParams: non zero errorcode 461

After that you can't apply and can't destroy anything. And it looks like the only way to continue is to delete terraformstate, rebuild VPX, and then re-apply citrix config again.

To Reproduce Steps to reproduce the behavior:

  1. I.e. bind ecc curve to vpnvservers resource "citrixadc_sslvserver_ecccurve_binding" "P_256" { for_each = { for vpnvserver in local.vservers : "${vpnvserver.product_name}-${vpnvserver.product_env}" => vpnvserver } ecccurvename = "P_256" vservername = citrixadc_vpnvserver.vpnvserver["${each.value.product_name}-${each.value.product_env}"].name }

  2. Remove ecc curve from config unbind ssl vserver XXX -eccCurveName P_256 unnbind ssl vserver YYY -eccCurveName P_256

  3. run terraform apply

  4. Error I am facing on the console

Expected behaviour Config object should be re-created if it's not found on the device.

Environment (please fill the following information):

ravager-dk commented 1 year ago

This is actually a misconception about how Terraform works. Yes, some providers can handle this situation, but for complex resource types this becomes problematic. The correct flow is to remove the deleted resources from the state using "Terraform state rm" https://developer.hashicorp.com/terraform/cli/commands/state/rm

An alternative solution is to recreate the resource in the NetScaler directly and then import it into the Terraform state.

Using Terraform to continuously configure your infrastructure requires you to only make changes to the resources through Terraform.

im-dim commented 1 year ago

This is actually a misconception about how Terraform works. Yes, some providers can handle this situation, but for complex resource types this becomes problematic. The correct flow is to remove the deleted resources from the state using "Terraform state rm" https://developer.hashicorp.com/terraform/cli/commands/state/rm

An alternative solution is to recreate the resource in the NetScaler directly and then import it into the Terraform state.

Using Terraform to continuously configure your infrastructure requires you to only make changes to the resources through Terraform.

The reported issue is just an example but have had many corrupted tfstate files when we run VPX failover tests between AZs and the only way to get out of there was a) delete tfstate on S3, b) delete lock from dynamo, c) rebuild both VPXes.

In order to have more stable environment, we switched to template files but problem there is that you can't delete some resources when parameters change (i.e. removing VIP).

im-dim commented 1 year ago

You should be able to reproduce this issue by, let's say, "corrupting" both VPXes (terminate instances) and then trying to rebuild them by re-running terraform...

Above should give you a bunch of the below errors, TF will fail, and you'll get CORRUPTED tfstate.

 Error: [ERROR] FindResourceArrayWithParams: non zero errorcode 344
 Error: [ERROR] FindResourceArrayWithParams: non zero errorcode 461

And the only way to restore environment is to delete tfstate on S3, delete lock from dynamoDB, and re-init environment

Is that expected?

Shouldn't absence of a resource be detected and a new one created as it's done for any other AWS resources?

kaiAsmOne commented 1 year ago

i do belive there is some gold for you to discover in @ravager-dk ´s comment. I do LARGE deploys in Azure and i do not have this issue but i have to do terraform state work as ravenger-dk suggests.

My roadmap is to go canary or blue/green deploys only in the future.. Whenever i do a change to a netscaler, deploy a new fresh netscaler from code. When deployed, change the Azure LB in front of the Netscaler. blue/green or canary will make issues like this non exsistent