When K8s cluster update issues occur during terraform plan/apply, need to be able to conduct another plan/apply to resume updates

Terraform Version

Terraform 13 w/ v1.21.2 IBM Cloud provider (running under schematics)

Affected Resource(s)

ibm_container_vpc_cluster

Details

This issue is a follow-up issue to #1978. Great progress has been made regarding steady state patching support with the IBM Cloud Provider. The terraform plan/apply is now updating one worker at a time based on the desired patch_version of the workers when used with wait_for_worker_update and update_all_workers set to true.

When cloud issues, or issues within terraform/schematics arise during a patching/update activity (where all the worker nodes in the cluster are not updated) a second attempt at a terraform plan/apply is unable to patch the remaining workers that were not updated prior to the error in terraform.

Here are a few examples of issues that we have seen during a terraform plan/apply:

Example 1 (IAM token timeout):

 2021/03/17 23:02:43 Terraform apply | 
 2021/03/17 23:02:43 Terraform apply | Error: Error waiting for cluster (c0sh3agw0siks096a9dg) worker nodes kube version to be updated: Error retrieving worker of container vpc cluster: Authentication failed, Unable to refresh auth token: Post "https://iam.cloud.ibm.com/identity/token": dial tcp 104.93.76.208:443: i/o timeout. Try again later
 2021/03/17 23:02:43 Terraform apply | 
 2021/03/17 23:02:43 Terraform apply |   on cluster/cluster.tf line 30, in resource "ibm_container_vpc_cluster" "cluster":
 2021/03/17 23:02:43 Terraform apply |   30:  resource "ibm_container_vpc_cluster" "cluster" {
 2021/03/17 23:02:43 Terraform apply | 
 2021/03/17 23:02:43 Terraform apply | 
 2021/03/17 23:02:43 Terraform APPLY error: Terraform APPLY errorexit status 1
 2021/03/17 23:02:43 Could not execute action

Example 2 (IBM Cloud Containers API is down):

 2021/03/18 17:54:48 Terraform apply | module.iks_cluster.ibm_container_vpc_cluster.cluster[0]: Still modifying... [id=c15vbj6w0nhnoeq10p40, 1h31m20s elapsed]
 2021/03/18 17:54:58 Terraform apply | module.iks_cluster.ibm_container_vpc_cluster.cluster[0]: Still modifying... [id=c15vbj6w0nhnoeq10p40, 1h31m30s elapsed]
 2021/03/18 17:55:08 Terraform apply | module.iks_cluster.ibm_container_vpc_cluster.cluster[0]: Still modifying... [id=c15vbj6w0nhnoeq10p40, 1h31m40s elapsed]
 2021/03/18 17:55:10 Terraform apply | 
 2021/03/18 17:55:10 Terraform apply | Error: Error waiting for cluster (c15vbj6w0nhnoeq10p40) worker nodes kube version to be updated: Error retrieving worker of container vpc cluster: Request failed with status code: 521, ServerErrorResponse: <!DOCTYPE html>
 2021/03/18 17:55:10 Terraform apply | <!--[if lt IE 7]> <html class="no-js ie6 oldie" lang="en-US"> <![endif]-->
 2021/03/18 17:55:10 Terraform apply | <!--[if IE 7]>    <html class="no-js ie7 oldie" lang="en-US"> <![endif]-->
 2021/03/18 17:55:10 Terraform apply | <!--[if IE 8]>    <html class="no-js ie8 oldie" lang="en-US"> <![endif]-->
 2021/03/18 17:55:10 Terraform apply | <!--[if gt IE 8]><!--> <html class="no-js" lang="en-US"> <!--<![endif]-->
 2021/03/18 17:55:10 Terraform apply | <head>
 2021/03/18 17:55:10 Terraform apply | <meta http-equiv="refresh" content="0">
 2021/03/18 17:55:10 Terraform apply | 
 2021/03/18 17:55:10 Terraform apply | <title>us-south.containers.cloud.ibm.com | 521: Web server is down</title>
 2021/03/18 17:55:10 Terraform apply | <meta charset="UTF-8" />
 2021/03/18 17:55:10 Terraform apply | <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
 2021/03/18 17:55:10 Terraform apply | <meta http-equiv="X-UA-Compatible" content="IE=Edge,chrome=1" />
 2021/03/18 17:55:10 Terraform apply | <meta name="robots" content="noindex, nofollow" />
 2021/03/18 17:55:10 Terraform apply | <meta name="viewport" content="width=device-width,initial-scale=1" />
 2021/03/18 17:55:10 Terraform apply | <link rel="stylesheet" id="cf_styles-css" href="/cdn-cgi/styles/main.css" type="text/css" media="screen,projection" />
 2021/03/18 17:55:10 Terraform apply | 
 2021/03/18 17:55:10 Terraform apply | 
 2021/03/18 17:55:10 Terraform apply | </head>

Example 3 (Cluster state not equal to normal - timeout):

2021/03/16 22:54:42 Terraform apply | module.iks_cluster.ibm_container_vpc_cluster.cluster[0]: Still modifying... [id=c15v4taw0961ffejojh0, 1h33m20s elapsed]
2021/03/16 22:54:52 Terraform apply | module.iks_cluster.ibm_container_vpc_cluster.cluster[0]: Still modifying... [id=c15v4taw0961ffejojh0, 1h33m30s elapsed]
2021/03/16 22:55:02 Terraform apply | module.iks_cluster.ibm_container_vpc_cluster.cluster[0]: Still modifying... [id=c15v4taw0961ffejojh0, 1h33m40s elapsed]
2021/03/16 22:55:12 Terraform apply | module.iks_cluster.ibm_container_vpc_cluster.cluster[0]: Still modifying... [id=c15v4taw0961ffejojh0, 1h33m50s elapsed]
2021/03/16 22:55:22 Terraform apply | module.iks_cluster.ibm_container_vpc_cluster.cluster[0]: Still modifying... [id=c15v4taw0961ffejojh0, 1h34m0s elapsed]
2021/03/16 22:55:32 Terraform apply | module.iks_cluster.ibm_container_vpc_cluster.cluster[0]: Still modifying... [id=c15v4taw0961ffejojh0, 1h34m10s elapsed]
2021/03/16 22:55:35 Terraform apply |
2021/03/16 22:55:35 Terraform apply | Error: Error waiting for cluster (c15v4taw0961ffejojh0) worker nodes kube version to be updated: timeout while waiting for state to become 'normal' (last state: 'updating', timeout: 1h0m0s)
2021/03/16 22:55:35 Terraform apply |
2021/03/16 22:55:35 Terraform apply |   on cluster/cluster.tf line 30, in resource "ibm_container_vpc_cluster" "cluster":
2021/03/16 22:55:35 Terraform apply |   30:  resource "ibm_container_vpc_cluster" "cluster" {
2021/03/16 22:55:35 Terraform apply |
2021/03/16 22:55:35 Terraform apply |
2021/03/16 22:55:35 Terraform apply |
2021/03/16 22:55:35 Terraform apply | Error: Error waiting for cluster (c15v4urw0v8lok3e8k30) worker nodes kube version to be updated: timeout while waiting for state to become 'normal' (last state: 'updating', timeout: 1h0m0s)
2021/03/16 22:55:35 Terraform apply |
2021/03/16 22:55:35 Terraform apply |   on cluster/cluster.tf line 30, in resource "ibm_container_vpc_cluster" "cluster":
2021/03/16 22:55:35 Terraform apply |   30:  resource "ibm_container_vpc_cluster" "cluster" {
2021/03/16 22:55:35 Terraform apply |
2021/03/16 22:55:35 Terraform apply |
2021/03/16 22:55:35 Terraform apply |
2021/03/16 22:55:35 Terraform apply | Error: Error waiting for cluster (c15v4uiw06ov69i10p2g) worker nodes kube version to be updated: timeout while waiting for state to become 'normal' (last state: 'updating', timeout: 1h0m0s)
2021/03/16 22:55:35 Terraform apply |
2021/03/16 22:55:35 Terraform apply |   on cluster/cluster.tf line 30, in resource "ibm_container_vpc_cluster" "cluster":
2021/03/16 22:55:35 Terraform apply |   30:  resource "ibm_container_vpc_cluster" "cluster" {
2021/03/16 22:55:35 Terraform apply |
2021/03/16 22:55:35 Terraform apply |
2021/03/16 22:55:35 Terraform APPLY error: Terraform APPLY errorexit status 1
2021/03/16 22:55:35 Could not execute action

For example 3, we did add a timeouts block addition for "updates" to see if that issue described above occurs with that setting moving forward:

        timeouts {
        create = "120m"
        update = "120m"
    }

In the three examples described above, if a subsequent terraform plan/apply was able to continue the patching process and update the remaining outstanding worker nodes, we would be able to easily work around any terraform plan/apply issues.

What we observed during subsequent terraform plan/applies is the terraform does not believe there is anything left to do, since the patch_version was set during the previous plan/apply. If subsequent terraform plan/applies could verify the state of the clusters and workers to see if any workers are outdated and not matching the patch_version to indicate more work still needs to be done there, that would give us a great solution to dealing with unpredictable issues that can occur over time - where we can simply reapply the terraform again.

Tested this retry fix in 1.23.1 and it worked great -- except unfortunately, today was a day where several cloud users experienced a 404 error when running basic cluster and worker commands. Therefore, when doing an apply to test the retry logic, the apply would inevitably fail too. In some cases the error occurred after having updated a worker, but in some cases even sooner. On the 4th attempt of the day, we saw that the provider seemingly tried to destroy and re-create the cluster -- which leads to all workers being deleted, and a disruption to the application that had been installed/configured and running!

What could have caused the provider to attempt a destroy/recreate? Did the 404's help trigger an execution path where the provider wanted to start fresh as a worst case recovery mechanism? Regardless, it seems too harsh to delete a cluster...especially since in this case, the re-create failed, leaving a long window of time where it might not be noticed...not to mention the added time needed to re-create the cluster and re-install the application/services -- once this is discovered by someone.

Harini @hkantare, please refer to the logs I posted in Slack on this. Here is a relevant log snip:

 2021/04/12 21:02:12 Terraform apply | module.iks_cluster.ibm_container_vpc_cluster.cluster[2]: Refreshing state... [id=c1f7n7ow03ojld3tqn00]
 2021/04/12 21:02:12 Terraform apply | module.iks_cluster.ibm_container_vpc_cluster.cluster[1]: Refreshing state... [id=c1f7n88w02n9ac41cje0]
 2021/04/12 21:02:12 Terraform apply | module.iks_cluster.ibm_container_vpc_cluster.cluster[0]: Refreshing state... [id=c1f7n8kw0opgtm605fpg]
 2021/04/12 21:02:26 Terraform apply | module.iks_cluster.ibm_container_vpc_cluster.cluster[2]: Destroying... [id=c1f7n7ow03ojld3tqn00]
 2021/04/12 21:02:26 Terraform apply | module.subnets_and_acls.ibm_is_network_acl.kube_subnet_acl[0]: Modifying... [id=r014-72acad89-fda0-4dc9-bd4d-6595f3578033]
 2021/04/12 21:02:36 Terraform apply | module.iks_cluster.ibm_container_vpc_cluster.cluster[2]: Still destroying... [id=c1f7n7ow03ojld3tqn00, 10s elapsed]
 2021/04/12 21:02:36 Terraform apply | module.subnets_and_acls.ibm_is_network_acl.kube_subnet_acl[0]: Still modifying... [id=r014-72acad89-fda0-4dc9-bd4d-6595f3578033, 10s elapsed]
 2021/04/12 21:02:39 Terraform apply | module.subnets_and_acls.ibm_is_network_acl.kube_subnet_acl[0]: Modifications complete after 13s [id=r014-72acad89-fda0-4dc9-bd4d-6595f3578033]
 2021/04/12 21:02:46 Terraform apply | module.iks_cluster.ibm_container_vpc_cluster.cluster[2]: Still destroying... [id=c1f7n7ow03ojld3tqn00, 20s elapsed]
 2021/04/12 21:02:56 Terraform apply | module.iks_cluster.ibm_container_vpc_cluster.cluster[2]: Still destroying... [id=c1f7n7ow03ojld3tqn00, 30s elapsed]
 2021/04/12 21:03:06 Terraform apply | module.iks_cluster.ibm_container_vpc_cluster.cluster[2]: Still destroying... [id=c1f7n7ow03ojld3tqn00, 40s elapsed]
 2021/04/12 21:03:16 Terraform apply | module.iks_cluster.ibm_container_vpc_cluster.cluster[2]: Still destroying... [id=c1f7n7ow03ojld3tqn00, 50s elapsed]
 2021/04/12 21:03:26 Terraform apply | module.iks_cluster.ibm_container_vpc_cluster.cluster[2]: Still destroying... [id=c1f7n7ow03ojld3tqn00, 1m0s elapsed]
 2021/04/12 21:03:35 Terraform apply | module.iks_cluster.ibm_container_vpc_cluster.cluster[2]: Destruction complete after 1m9s
 2021/04/12 21:03:36 Terraform apply | module.iks_cluster.ibm_container_vpc_cluster.cluster[2]: Creating...
 2021/04/12 21:03:41 Terraform apply | 
 2021/04/12 21:03:41 Terraform apply | Warning: Attribute is deprecated
 2021/04/12 21:03:41 Terraform apply | 
 2021/04/12 21:03:41 Terraform apply | The generation field is deprecated and will be removed after couple of
 2021/04/12 21:03:41 Terraform apply | releases
 2021/04/12 21:03:41 Terraform apply | 
 2021/04/12 21:03:41 Terraform apply | 
 2021/04/12 21:03:41 Terraform apply | Error: Request failed with status code: 409, ServerErrorResponse: {"incidentID":"63ef66fa351fe043-DFW","code":"E0007","description":"A cluster with the same name already exists. Choose another name.","type":"Provisioning"}
 2021/04/12 21:03:41 Terraform apply | 
 2021/04/12 21:03:41 Terraform apply |   on cluster/cluster.tf line 30, in resource "ibm_container_vpc_cluster" "cluster":
 2021/04/12 21:03:41 Terraform apply |   30:  resource "ibm_container_vpc_cluster" "cluster" {
 2021/04/12 21:03:41 Terraform apply | 
 2021/04/12 21:03:41 Terraform apply | 
 2021/04/12 21:03:41 Terraform APPLY error: Terraform APPLY errorexit status 1
 2021/04/12 21:03:41 Could not execute action

We added more validation along with status code 404 to confirm really if th cluster exists or not to elimintate these kind of intermittent issues from IKS.. We fixed as part of new release https://github.com/IBM-Cloud/terraform-provider-ibm/releases/tag/v1.23.2

Using our latest maintenance automation leveraging the latest IBM Terraform provider, we did a bulk cycle on Tuesday, across multiple envs and their vpc-gen2 clusters.

We had mixed results with some successes, and some failures on the Terraform apply. Fortunately, for the failures, a restart was able to resume the worker updates where things left off at the time of the failure.

In most cases, the "apply" failure as the 2h timeout like so: _2021/06/09 23:15:58 Terraform apply | Error: Error waiting for cluster (c0rdesvw07u9emdmomkg) worker nodes kube version to be updated: timeout while waiting for state to become 'normal' (last state: 'updating', timeout: 2h0m0s) 2021/06/09 23:15:58 Terraform apply | 2021/06/09 23:15:58 Terraform apply | on cluster/cluster.tf line 30, in resource "ibm_container_vpc_cluster" "cluster": 2021/06/09 23:15:58 Terraform apply | 30: resource "ibm_container_vpccluster" "cluster" {

Documenting here as it is the same as documented in the original writeup of this git issue you are reading now (see at the top): Example 3 (Cluster state not equal to normal - timeout):

QUESTION: Is there perhaps a known bug (issue) already written for a potentially erroneous timeout? We are suspicious that the worker did update cleanly, and that there might be bug in the logic to help ensure only one worker is updated at a time. It can be a bit tricky as our prior automation suffered from an interesting problem seen with vpc-gen2 workers where our algorithm to loop through them one at a time tried to leverage the total worker count as an iterator. However, the total # of workers will vary by one at the precise time one has been deleted. This simple fact caused our loop to spin, and eventually timeout -- something very similar to what we see with the Terraform provider. Just a guess, but it seems there is something to look at here, as we seek an explanation why we see so many timeouts to look for a single worker to get updated.

Closing the issue since we fixed some of the issues of upgrades 1) supporting patch_version argument 2) Show diff on patch_version if any of worker nodes fails to update and upgrade remaining nodes in next terraform apply

I believe i just ran into this issue using schematics and ibm-cloud terraform:

Error: Request failed with status code: 409, ServerErrorResponse: {"incidentID":"47fc3144-fab0-92be-8e4a-12fd3dbc885c,47fc3144-fab0-92be-8e4a-12fd3dbc885c","code":"E0007","description":"A cluster with the same name already exists. Choose another name.","type":"Provisioning"}

INTERNAL GITHUB: https://gist.github.ibm.com/jja/9e38d2fe616b93b053fb38271e71985e

IBM-Cloud / terraform-provider-ibm