IBM-Cloud / terraform-provider-ibm

https://registry.terraform.io/providers/IBM-Cloud/ibm/latest/docs
Mozilla Public License 2.0
341 stars 670 forks source link

Add more retries to resource group deletion #5537

Closed ocofaigh closed 2 months ago

ocofaigh commented 3 months ago

A common use case is to provision resource group + OCP VPC cluster as part of the same terraform script. When you provision an OCP VPC cluster, it automatically provisions a VPC load balancer. Terraform does not know about this load balancer (its not in the state file). So when you run a terraform destroy, it almost always fails on first attempt with the error:

 2024/07/22 13:36:19 Terraform destroy |     "Result": {
 2024/07/22 13:36:19 Terraform destroy |         "errors": [
 2024/07/22 13:36:19 Terraform destroy |             {
 2024/07/22 13:36:19 Terraform destroy |                 "code": "NOT_EMPTY",
 2024/07/22 13:36:19 Terraform destroy |                 "message": "Resource groups with active instances can't be deleted. Use the CLI command \"ibmcloud resource service-instances --type all -g \u003cresource-group\u003e\" to check for remaining instances, then delete the instances and try again.",
 2024/07/22 13:36:19 Terraform destroy |                 "more_info": "n/a"
 2024/07/22 13:36:19 Terraform destroy |             }
 2024/07/22 13:36:19 Terraform destroy |         ],

By running the command ibmcloud resource service-instances --type all -g <resource-group> I can see that indeed the group still contains a VPC load balancer - for example:

[
  {
    "guid": "crn:v1:bluemix:public:containers-kubernetes:us-south:a/abac0df06b644a9cabc6e44f55b3880e:cqjqkvvd0c64fpr2h9j0:nlb:nlb-con2-workload-cluster-3b5bf5f75003778663c521c8c35ad277-i000.us-south.containers.appdomain.cloud",
    "id": "crn:v1:bluemix:public:containers-kubernetes:us-south:a/abac0df06b644a9cabc6e44f55b3880e:cqjqkvvd0c64fpr2h9j0:nlb:nlb-con2-workload-cluster-3b5bf5f75003778663c521c8c35ad277-i000.us-south.containers.appdomain.cloud",
    "url": "/v2/resource_instances/crn:v1:bluemix:public:containers-kubernetes:us-south:a%2Fabac0df06b644a9cabc6e44f55b3880e:cqjqkvvd0c64fpr2h9j0:nlb:nlb-con2-workload-cluster-3b5bf5f75003778663c521c8c35ad277-i000.us-south.containers.appdomain.cloud",
    "created_at": "2024-07-29T15:38:22Z",
    "updated_at": "2024-07-29T15:38:22Z",
    "deleted_at": null,
    "name": "nlb-con2-workload-cluster-3b5bf5f75003778663c521c8c35ad277-i000.us-south.containers.appdomain.cloud",
    "region_id": "us-south",
    "account_id": "abac0df06b644a9cabc6e44f55b3880e",
    "resource_plan_id": "containers.kubernetes.multizone.load.balancer",
    "resource_group_id": "0ed9fc69d01c48a092dd1600f63de2fa",
    "crn": "crn:v1:bluemix:public:containers-kubernetes:us-south:a/abac0df06b644a9cabc6e44f55b3880e:cqjqkvvd0c64fpr2h9j0:nlb:nlb-con2-workload-cluster-3b5bf5f75003778663c521c8c35ad277-i000.us-south.containers.appdomain.cloud",
    "create_time": 1722267502000,
    "created_by": "iam-ServiceId-1829dcf6-eb99-4760-81ad-6ca95cbab194",
    "state": "active",
    "type": "service_instance",
    "resource_id": "containers-kubernetes",
    "dashboard_url": null,
    "allow_cleanup": false,
    "locked": false,
    "last_operation": {
      "type": "create",
      "state": "succeeded",
      "description": "Instance provisioning is completed.",
      "updated_at": null,
      "cancelable": false
    },
    "account_url": "",
    "resource_plan_url": "",
    "resource_bindings_url": "/v2/resource_instances/crn:v1:bluemix:public:containers-kubernetes:us-south:a%2Fabac0df06b644a9cabc6e44f55b3880e:cqjqkvvd0c64fpr2h9j0:nlb:nlb-con2-workload-cluster-3b5bf5f75003778663c521c8c35ad277-i000.us-south.containers.appdomain.cloud/resource_bindings",
    "resource_aliases_url": "/v2/resource_instances/crn:v1:bluemix:public:containers-kubernetes:us-south:a%2Fabac0df06b644a9cabc6e44f55b3880e:cqjqkvvd0c64fpr2h9j0:nlb:nlb-con2-workload-cluster-3b5bf5f75003778663c521c8c35ad277-i000.us-south.containers.appdomain.cloud/resource_aliases",
    "siblings_url": "",
    "target_crn": "crn:v1:bluemix:public:globalcatalog::::deployment:containers.kubernetes.multizone.load.balancer%3Aus-south"
  }
]

If I wait some time, this eventually get deleted and resource group deletion passes. I would like to propose that the terraform provider is updated to add more retries when attempting to delete a resource group to cover such a use case. An even nicer enhancement would be to actually output the content that are remaining in the resource group that is preventing deletion from occurring.

Community Note

Terraform CLI and Terraform IBM Provider Version

Affected Resource(s)

Terraform Configuration Files

Please include all Terraform configurations required to reproduce the bug. Bug reports without a functional reproduction may be closed without investigation.

# Copy-paste your Terraform configurations here - for large Terraform configs,
# please share a link to the ZIP file.

Debug Output

Panic Output

Expected Behavior

Actual Behavior

Steps to Reproduce

  1. terraform apply

Important Factoids

References

hkantare commented 3 months ago

@ocofaigh As part of cluster delete we already have check to wait for load balancer to be deleted https://github.com/IBM-Cloud/terraform-provider-ibm/blob/67305d7590bf0974badc7d141addde94390c7b75/ibm/service/kubernetes/resource_ibm_container_vpc_cluster.go#L1022 Need to analyze even after this wait for delete also resource group n't able to disassociate from that particular instance

hkantare commented 3 months ago

Second approach : As part of resource group delete add some conditional logic to check for any existing instance association and wait for certain time

ocofaigh commented 3 months ago

@hkantare Thanks for feedback. So it sounds like isWaitForLBDeleted is not working as expected, so that should probably be debugged. I'm able to very easily reproduce using this code (which is the same as the Red Hat OpenShift Container Platform on VPC landing zone tile in IBM Cloud catalog).

+1 for the second approach too though, as I have seen other resources with similar issues. PAG is another one, as it provisions an sdnlb that terraform state does not know about

ocofaigh commented 2 months ago

@hkantare Do you think this is something that could be prioritised?

As part of resource group delete add some conditional logic to check for any existing instance association and wait for certain time

Its something that consumers keep on hitting, especially since most of the Deployable Architectures that are available in the IBM Cloud catalog support creating a resource group. When people do a destroy (especially when OCP cluster are destroyed), the resource group delete fails very frequently with:

 2024/08/27 11:40:06 Terraform destroy |       "Result": {
 2024/08/27 11:40:06 Terraform destroy |           "errors": [
 2024/08/27 11:40:06 Terraform destroy |               {
 2024/08/27 11:40:06 Terraform destroy |                   "code": "NOT_EMPTY",
 2024/08/27 11:40:06 Terraform destroy |                   "message": "Resource groups with active instances can't be deleted. Use the CLI command \"ibmcloud resource service-instances --type all -g \u003cresource-group\u003e\" to check for remaining instances, then delete the instances and try again.",
 2024/08/27 11:40:06 Terraform destroy |                   "more_info": "n/a"
 2024/08/27 11:40:06 Terraform destroy |               }
 2024/08/27 11:40:06 Terraform destroy |           ],
 2024/08/27 11:40:06 Terraform destroy |           "trace": "80e645c8-e323-4893-b0c1-b0d8a82ee0b6"
 2024/08/27 11:40:06 Terraform destroy |       },
 2024/08/27 11:40:06 Terraform destroy |       "RawResult": null
 2024/08/27 11:40:06 Terraform destroy |   }
hkantare commented 2 months ago

@ocofaigh We will plan to add some retry for resource group delete. Can you share what is the status code associated for above error?

ocofaigh commented 2 months ago

@hkantare "StatusCode": 500

Full output:

2024/08/27 11:40:06 Terraform destroy | Error: [ERROR] Error Deleting resource group: Resource groups with active instances can't be deleted. Use the CLI command "ibmcloud resource service-instances --type all -g <resource-group>" to check for remaining instances, then delete the instances and try again. with response code  {
 2024/08/27 11:40:06 Terraform destroy |     "StatusCode": 500,
 2024/08/27 11:40:06 Terraform destroy |     "Headers": {
 2024/08/27 11:40:06 Terraform destroy |         "Cache-Control": [
 2024/08/27 11:40:06 Terraform destroy |             "max-age=0, no-cache, no-store"
 2024/08/27 11:40:06 Terraform destroy |         ],
 2024/08/27 11:40:06 Terraform destroy |         "Content-Length": [
 2024/08/27 11:40:06 Terraform destroy |             "332"
 2024/08/27 11:40:06 Terraform destroy |         ],
 2024/08/27 11:40:06 Terraform destroy |         "Content-Type": [
 2024/08/27 11:40:06 Terraform destroy |             "application/json; charset=utf-8"
 2024/08/27 11:40:06 Terraform destroy |         ],
 2024/08/27 11:40:06 Terraform destroy |         "Date": [
 2024/08/27 11:40:06 Terraform destroy |             "Tue, 27 Aug 2024 11:40:06 GMT"
 2024/08/27 11:40:06 Terraform destroy |         ],
 2024/08/27 11:40:06 Terraform destroy |         "Etag": [
 2024/08/27 11:40:06 Terraform destroy |             "W/\"14c-POn/BpsPEJ94sjfRFJOtr4bZwxc\""
 2024/08/27 11:40:06 Terraform destroy |         ],
 2024/08/27 11:40:06 Terraform destroy |         "Expires": [
 2024/08/27 11:40:06 Terraform destroy |             "Tue, 27 Aug 2024 11:40:06 GMT"
 2024/08/27 11:40:06 Terraform destroy |         ],
 2024/08/27 11:40:06 Terraform destroy |         "Pragma": [
 2024/08/27 11:40:06 Terraform destroy |             "no-cache"
 2024/08/27 11:40:06 Terraform destroy |         ],
 2024/08/27 11:40:06 Terraform destroy |         "Server": [
 2024/08/27 11:40:06 Terraform destroy |             "istio-envoy"
 2024/08/27 11:40:06 Terraform destroy |         ],
 2024/08/27 11:40:06 Terraform destroy |         "Strict-Transport-Security": [
 2024/08/27 11:40:06 Terraform destroy |             "max-age=31536000; includeSubDomains"
 2024/08/27 11:40:06 Terraform destroy |         ],
 2024/08/27 11:40:06 Terraform destroy |         "Transaction-Id": [
 2024/08/27 11:40:06 Terraform destroy |             "80e645c8-e323-4893-b0c1-b0d8a82ee0b6"
 2024/08/27 11:40:06 Terraform destroy |         ],
 2024/08/27 11:40:06 Terraform destroy |         "Vary": [
 2024/08/27 11:40:06 Terraform destroy |             "Accept-Encoding"
 2024/08/27 11:40:06 Terraform destroy |         ],
 2024/08/27 11:40:06 Terraform destroy |         "X-Content-Type-Options": [
 2024/08/27 11:40:06 Terraform destroy |             "nosniff"
 2024/08/27 11:40:06 Terraform destroy |         ],
 2024/08/27 11:40:06 Terraform destroy |         "X-Envoy-Upstream-Service-Time": [
 2024/08/27 11:40:06 Terraform destroy |             "169"
 2024/08/27 11:40:06 Terraform destroy |         ],
 2024/08/27 11:40:06 Terraform destroy |         "X-Ratelimit-Limit": [
 2024/08/27 11:40:06 Terraform destroy |             "60"
 2024/08/27 11:40:06 Terraform destroy |         ],
 2024/08/27 11:40:06 Terraform destroy |         "X-Ratelimit-Remaining": [
 2024/08/27 11:40:06 Terraform destroy |             "59"
 2024/08/27 11:40:06 Terraform destroy |         ],
 2024/08/27 11:40:06 Terraform destroy |         "X-Ratelimit-Reset": [
 2024/08/27 11:40:06 Terraform destroy |             "0"
 2024/08/27 11:40:06 Terraform destroy |         ],
 2024/08/27 11:40:06 Terraform destroy |         "X-Request-Id": [
 2024/08/27 11:40:06 Terraform destroy |             "80e645c8-e323-4893-b0c1-b0d8a82ee0b6"
 2024/08/27 11:40:06 Terraform destroy |         ],
 2024/08/27 11:40:06 Terraform destroy |         "X-Response-Time": [
 2024/08/27 11:40:06 Terraform destroy |             "166.360ms"
 2024/08/27 11:40:06 Terraform destroy |         ],
 2024/08/27 11:40:06 Terraform destroy |         "_request_id": [
 2024/08/27 11:40:06 Terraform destroy |             "80e645c8-e323-4893-b0c1-b0d8a82ee0b6"
 2024/08/27 11:40:06 Terraform destroy |         ]
 2024/08/27 11:40:06 Terraform destroy |     },
 2024/08/27 11:40:06 Terraform destroy |     "Result": {
 2024/08/27 11:40:06 Terraform destroy |         "errors": [
 2024/08/27 11:40:06 Terraform destroy |             {
 2024/08/27 11:40:06 Terraform destroy |                 "code": "NOT_EMPTY",
 2024/08/27 11:40:06 Terraform destroy |                 "message": "Resource groups with active instances can't be deleted. Use the CLI command \"ibmcloud resource service-instances --type all -g \u003cresource-group\u003e\" to check for remaining instances, then delete the instances and try again.",
 2024/08/27 11:40:06 Terraform destroy |                 "more_info": "n/a"
 2024/08/27 11:40:06 Terraform destroy |             }
 2024/08/27 11:40:06 Terraform destroy |         ],
 2024/08/27 11:40:06 Terraform destroy |         "trace": "80e645c8-e323-4893-b0c1-b0d8a82ee0b6"
 2024/08/27 11:40:06 Terraform destroy |     },
 2024/08/27 11:40:06 Terraform destroy |     "RawResult": null
 2024/08/27 11:40:06 Terraform destroy | }
hkantare commented 2 months ago

@ocofaigh Added this retry logic for deletion of resource grp with default timeout of 20 mins. Mostly this should be able to address the deletion of cluster alb, pag.

ocofaigh commented 2 months ago

Thanks, I see it was released in 1.69.0 so going to close this issue. If I see any issues, I'll let you know